Stealthy Crawling using Scrapy, Tor and Privoxy

Khalid AlnajjarProgramming, Security Leave a Comment

Sometimes one needs to crawl certain information online as part of their project. However, websites do not like crawlers much because of obvious reasons. As a result, websites would implement a mechanism for blocking crawlers. In this post, I will explain how to crawl websites without exposing your information and, in case the crawler got blocked, it is capable of changing its identity and bypass any blockage. In doing so, we will be using Tor network and proxy it to our crawler using Privoxy. The crawler we’ll implement is a simple crawler using Scrapy.

Installing and Configuring Tor with Privoxy

Now, let’s install Tor and Privoxy. On Debian/Ubuntu, you should be able to install it using the commands below:

sudo apt-get update
sudo apt-get install tor tor-geoipdb privoxy

Configuring Tor

If you just want to set up tor, you don’t need to perform any edits. However, in case you’d like to be able to automatically control Tor from a script, you’d need to set the control port and password. First, generate a hash for your secure password using (replace PASSWORDHERE with your password):

tor --hash-password PASSWORDHERE

Next, copy the generated hash and add the below lines to the end of /etc/tor/torrc (replace GENERATEDHASH with the hash generated):

ControlPort 9051
HashedControlPassword GENERATEDHASH

Configuring Privoxy

With your favorite editor, add the below lines at the end of /etc/privoxy/config

forward-socks5t / 127.0.0.1:9050 .

# Optional
keep-alive-timeout 600
default-server-timeout 600
socket-timeout 600

Now that everything is configured, all you have to do is start the services by running:

sudo service privoxy start
sudo service tor start

To test that everything is working properly, curl http://ifconfig.me to get your current IP and test it using tor and the privoxy proxy. The current IP must be different than the rest.

curl http://ifconfig.me # get your current IP
torify curl http://ifconfig.me # test Tor
curl -x 127.0.0.1:8118 https://ifconfig.me # test privoxy

Crawling using Scrapy with Tor

Scrapy is a great Python framework for building crawlers, it is easy to use and offers great customizations. We will be using it in this post; however, the method is generally still usable in other languages. Let’s create a new project and spider using scrapy:

python3 -m venv venv # create a virtual environment
source venv/bin/activate # activate it
pip install -U scrapy stem requests[socks] # install dependencies

scrapy startproject mokha; cd mokha # create a project
scrapy genspider ifconfig ifconfig.me # create a spider
mkdir middlewares; touch middlewares/__init__.py # create a middleware

To activate the middleware, add the following lines at the end of settings.py:

DOWNLOADER_MIDDLEWARES = {
    'mokha.middlewares.ProxyMiddleware.ProxyMiddleware': 543,
}

Create ProxyMiddleware.py inside the middlewares folder and place the following code in it. Simply, the function new_tor_identity sends a signal to Tor controller to issue us a new identity. Make sure to change the passowrd PASSWORDHERE to the one you used earlier when configuring tor. You can call the function in either process_request or process_response. If you want to issue a new identity for every request, place it in the former function. But if you’d like to get a new identity in certain situations (e.g. you have been blocked), call it from process_response after verifying that you have been blocked.

from stem import Signal
from stem.control import Controller
from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware


def new_tor_identity():
    with Controller.from_port(port=9051) as controller:
        controller.authenticate(password='PASSWORDHERE')
        controller.signal(Signal.NEWNYM)


class ProxyMiddleware(HttpProxyMiddleware):
    def process_response(self, request, response, spider):
        # Get a new identity depending on the response
        if response.status != 200:
            new_tor_identity()
            return request
        return response


    def process_request(self, request, spider):
        # Set the Proxy

        # A new identity for each request
        # Comment out if you want to get a new Identity only through process_response
        new_tor_identity()

        request.meta['proxy'] = 'http://127.0.0.1:8118'

Lastly, implement the crawler (ifconfig.py). This is really a very simple proof-of-concept crawler as all what it does is querying a single page and logging out our IP.

import scrapy

class IfconfigSpider(scrapy.Spider):
    name = 'ifconfig'
    allowed_domains = ['ifconfig.me']
    start_urls = ['http://ifconfig.me/']

    def parse(self, response):
        self.log('IP : %s' % response.css('#ip_address').get())

Running the command scrapy crawl ifconfig twice will report two different IPs, indicating that everything worked smoothly as intended. Hope you have found this post helpful.