Python Scrapy Integration

In this guide, we’ll see how you can easily use ScraperAPI with Python’s Scrapy web scraping framework. We will walk you through exactly how to integrate ScraperAPI with your Scrapy spiders so you can get the most out of ScraperAPI.

Full code examples can be found on GitHub here.

Getting Started: Sending Requests With ScraperAPI

Using ScraperAPI as your proxy solution is very straightforward. All you need to do is send us the URL you want to scrape to us via our API endpoint, Python SDK, or proxy port and we will manage everything to do with proxy/header rotation, automatic retries, ban detection, and CAPTCHA bypassing.

First, let’s look at your typical request code in Scrapy:

bash

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

To integrate ScraperAPI with your Scrapy spiders we just need to change the Scrapy request below to send your requests to ScraperAPI instead of directly to the website:

bash

yield scrapy.Request(url=url, callback=self.parse)

Luckily, reconfiguring this is super easy. You can choose from 3 ways to do so.

API Endpoint

If you want to send your requests to our API endpoint http://api.scraperapi.com/ you can include a simple function get_scraperapi_url in your code that will reformat the url into an API request to be sent to our API endpoint, and then use this in function in your Scrapy requests.

bash

import scrapy

API_KEY = 'YOUR_API_KEY'

def get_scraperapi_url(url):
    payload = {'api_key': API_KEY, 'url': url}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=get_scraperapi_url(url), callback=self.parse)


SDK

To avoid you having to write this function, we’ve created a Python Scrapy SDK that takes care of this for you. All you need to do is install the SDK using pip.

bash

pip install scraperapi-sdk

Then integrate the SDK into your code by initialising the ScraperAPIClient with your API key and then using the client.scrapyGet method to make requests.

bash

import scrapy
from scraper_api import ScraperAPIClient
client = ScraperAPIClient('YOUR_API_KEY')

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(client.scrapyGet(url=url), callback=self.parse)


Proxy Mode

Alternatively, you can also use ScraperAPI like any normal proxy solution and simply set our proxy port http://scraperapi:YOUR_API_KEY@proxy-server.scraperapi.com:8001 as the proxy in the meta parameter like you would any other proxy.

bash

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]

        meta = {
            "proxy": "http://scraperapi:YOUR_API_KEY@proxy-server.scraperapi.com:8001"
            }

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse, meta=meta)



SSL Cert Verification – Scrapy skips SSL verification by default so no need to disable SSL verification with these requests.

Using Additional ScraperAPI Functionality

ScraperAPI enables you to customize the APIs functionality by adding additional parameters to your requests. For example, you can tell ScraperAPI to render any Javascript on the target website by adding render=true to your request.

bash
import scrapy

API_KEY = 'YOUR_API_KEY'

def get_scraperapi_url(url):
    payload = {'api_key': API_KEY, 'url': url, ‘render’: ‘true’}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=get_scraperapi_url(url), callback=self.parse)

The API will accept the following parameters:

Parameter Description
render Activate javascript rendering by setting render=true in your request. The API will automatically render the javascript on the page and return the HTML response after the javascript has been rendered.
country_code Activate country geotargeting by setting country_code=us to use US proxies for example.
premium Activate premium residential and mobile IPs by setting premium=true. Using premium proxies costs 10 API credits, or 25 API credits if used in combination with Javascript rendering.
session_number Reuse the same proxy by setting session_number=123 for example.
keep_headers Use your own custom headers by setting keep_headers=true along with sending your own headers to the API.
device_type Set your requests to use mobile or desktop user agents by setting device_type=desktop or device_type=mobile.
autoparse Activate auto parsing for select websites by setting autoparse=true. The API will parse the data on the page and return it in JSON format.

Sending Custom Headers To ScraperAPI

By default, ScraperAPI will select the best header configuration to get the highest possible success rate from your target website, however, in certain circumstances you want to use your own headers. In these cases, you can use your own custom headers by simply adding keep_headers=true to the ScraperAPI request and send the headers as you would normally do.

bash
import scrapy

API_KEY = 'YOUR_API_KEY'

def get_scraperapi_url(url):
    payload = {'api_key': API_KEY, 'url': url, keep_headers: 'true'}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]

        headers = {
            "X-MyHeader": "123"
            }

        for url in urls:
            yield scrapy.Request(url=get_scraperapi_url(url), headers=headers, callback=self.parse)

Configuring Your Scrapy Settings

To get the most out of your ScraperAPI plan then you need to change a couple of settings in your Scrapy projects settings.py file.

Concurrency

ScraperAPI is designed to allow you to increase your scraping from a couple hundred pages per day to millions of pages per day, simply by changing your plan to have a higher concurrent thread limit.

The more concurrent threads you have the more requests you can have active in parallel, and the faster you can scrape. 

To change the number of concurrent threads your spiders have access to, you simply need to set the CONCURRENT_REQUESTS setting to the number of threads available in your plan. 

For reference: Free Plan (5 threads), Hobby Plan (10 threads), Startup Plan (25 threads), Business Plan (50 threads), and Enterprise Plans (up to 5,000 threads).

bash
## settings.py

CONCURRENT_REQUESTS = 5  ## Free Plan has 5 concurrent threads.

# DOWNLOAD_DELAY
# RANDOMIZE_DOWNLOAD_DELAY

Note: you should make sure that DOWNLOAD_DELAY and RANDOMIZE_DOWNLOAD_DELAY aren’t enabled in your settings.py file as these will lower your concurrency and are not needed with ScraperAPI.

Retrying Failed Requests

For most sites, over 97% of your requests will be successful on the first try, however, it is inevitable that some requests will fail. For these failed requests, the API will return a 500 status code and won’t charge you for the request.

In this case, if we want to set our code to automatically retry these failed requests as virtually 100% of requests will be successful after 3 retries unless there is an issue with the site.

Scrapy already has the functionality built in to catch and retry failed requests, so you simply need to set the RETRY_TIMES setting in the settings.py file to 3 or more retries.

bash
## settings.py

RETRY_TIMES = 3

Disable Obeying Robots.txt

By default, Scrapy will first send a request to the target website’s robot.txt file and verify that they allow you to access their site programmatically. However, this can interfere with ScraperAPI if you send the requests to the API endpoint. To prevent this, you need to set ROBOTSTXT_OBEY=False in your settings.py file.

Your final settings.py file should have the following settings enabled and disabled (along with any other settings you want to set).

bash
## settings.py

ROBOTSTXT_OBEY = False

CONCURRENT_REQUESTS = 5  ## Free Plan has 5 concurrent threads.

RETRY_TIMES = 3

# DOWNLOAD_DELAY
# RANDOMIZE_DOWNLOAD_DELAY

The full code examples can be found on GitHub here.

Table of Contents

Ready to start scraping?

Get started with 5,000 free API credits or contact sales

No credit card required