Python Scrapy Integration

March, 2021

In this guide, we’ll see how you can easily use ScraperAPI with Python’s Scrapy web scraping framework. We will walk you through exactly how to integrate ScraperAPI with your Scrapy spiders so you can get the most out of ScraperAPI.

Full code examples can be found on GitHub here.

Getting Started: Sending Requests With ScraperAPI

Using ScraperAPI as your proxy solution is very straightforward. All you need to do is send us the URL you want to scrape to us via our API endpoint, Python SDK, or proxy port and we will manage everything to do with proxy/header rotation, automatic retries, ban detection, and CAPTCHA bypassing.

First, let’s look at your typical request code in Scrapy:


import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

To integrate ScraperAPI with your Scrapy spiders we just need to change the Scrapy request below to send your requests to ScraperAPI instead of directly to the website:

Luckily, reconfiguring this is super easy. You can choose from 3 ways to do so.

API Endpoint

If you want to send your requests to our API endpoint http://api.scraperapi.com/ you can include a simple function get_scraperapi_url in your code that will reformat the url into an API request to be sent to our API endpoint, and then use this in function in your Scrapy requests.


import scrapy

API_KEY = 'YOUR_API_KEY'

def get_scraperapi_url(url):
    payload = {'api_key': API_KEY, 'url': url}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=get_scraperapi_url(url), callback=self.parse)

SDK

To avoid you having to write this function, we’ve created a Python Scrapy SDK that takes care of this for you. All you need to do is install the SDK using pip.

Then integrate the SDK into your code by initialising the ScraperAPIClient with your API key and then using the client.scrapyGet method to make requests.


import scrapy
from scraper_api import ScraperAPIClient
client = ScraperAPIClient('YOUR_API_KEY')

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(client.scrapyGet(url=url), callback=self.parse)

Proxy Mode

Alternatively, you can also use ScraperAPI like any normal proxy solution and simply set our proxy port http://scraperapi:YOUR_API_KEY@proxy-server.scraperapi.com:8001 as the proxy in the meta parameter like you would any other proxy.


import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]

        meta = {
            "proxy": "http://scraperapi:YOUR_API_KEY@proxy-server.scraperapi.com:8001"
            }

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse, meta=meta)

SSL Cert Verification – Scrapy skips SSL verification by default so no need to disable SSL verification with these requests.

Using Additional ScraperAPI Functionality

ScraperAPI enables you to customize the APIs functionality by adding additional parameters to your requests. For example, you can tell ScraperAPI to render any Javascript on the target website by adding render=true to your request.

import scrapy

API_KEY = 'YOUR_API_KEY'

def get_scraperapi_url(url):
    payload = {'api_key': API_KEY, 'url': url, ‘render’: ‘true’}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=get_scraperapi_url(url), callback=self.parse)

The API will accept the following parameters:

Parameter	Description
`render`	Activate javascript rendering by setting `render=true` in your request. The API will automatically render the javascript on the page and return the HTML response after the javascript has been rendered.
`country_code`	Activate country geotargeting by setting `country_code=us` to use US proxies for example.
`premium`	Activate premium residential and mobile IPs by setting `premium=true`. Using premium proxies costs 10 API credits, or 25 API credits if used in combination with Javascript rendering.
`session_number`	Reuse the same proxy by setting `session_number=123` for example.
`keep_headers`	Use your own custom headers by setting `keep_headers=true` along with sending your own headers to the API.
`device_type`	Set your requests to use mobile or desktop user agents by setting `device_type=desktop` or `device_type=mobile`.
`autoparse`	Activate auto parsing for select websites by setting `autoparse=true`. The API will parse the data on the page and return it in JSON format.

Sending Custom Headers To ScraperAPI

By default, ScraperAPI will select the best header configuration to get the highest possible success rate from your target website, however, in certain circumstances you want to use your own headers. In these cases, you can use your own custom headers by simply adding keep_headers=true to the ScraperAPI request and send the headers as you would normally do.

import scrapy

API_KEY = 'YOUR_API_KEY'

def get_scraperapi_url(url):
    payload = {'api_key': API_KEY, 'url': url, keep_headers: 'true'}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]

        headers = {
            "X-MyHeader": "123"
            }

        for url in urls:
            yield scrapy.Request(url=get_scraperapi_url(url), headers=headers, callback=self.parse)

Configuring Your Scrapy Settings

To get the most out of your ScraperAPI plan then you need to change a couple of settings in your Scrapy projects settings.py file.

Concurrency

ScraperAPI is designed to allow you to increase your scraping from a couple hundred pages per day to millions of pages per day, simply by changing your plan to have a higher concurrent thread limit.

The more concurrent threads you have the more requests you can have active in parallel, and the faster you can scrape.

To change the number of concurrent threads your spiders have access to, you simply need to set the CONCURRENT_REQUESTS setting to the number of threads available in your plan.

For reference: Free Plan (5 threads), Hobby Plan (10 threads), Startup Plan (25 threads), Business Plan (50 threads), and Enterprise Plans (up to 5,000 threads).

Note: you should make sure that DOWNLOAD_DELAY and RANDOMIZE_DOWNLOAD_DELAY aren’t enabled in your settings.py file as these will lower your concurrency and are not needed with ScraperAPI.

Retrying Failed Requests

For most sites, over 97% of your requests will be successful on the first try, however, it is inevitable that some requests will fail. For these failed requests, the API will return a 500 status code and won’t charge you for the request.

In this case, if we want to set our code to automatically retry these failed requests as virtually 100% of requests will be successful after 3 retries unless there is an issue with the site.

Scrapy already has the functionality built in to catch and retry failed requests, so you simply need to set the RETRY_TIMES setting in the settings.py file to 3 or more retries.

Disable Obeying Robots.txt

By default, Scrapy will first send a request to the target website’s robot.txt file and verify that they allow you to access their site programmatically. However, this can interfere with ScraperAPI if you send the requests to the API endpoint. To prevent this, you need to set ROBOTSTXT_OBEY=False in your settings.py file.

Your final settings.py file should have the following settings enabled and disabled (along with any other settings you want to set).

The full code examples can be found on GitHub here.

Ready to start scraping?

Get started with 5,000 free API credits or contact sales

Get Started For Free

Ready to start scraping?

Get started with 5,000 free API credits or contact sales

No credit card required

Async Scraper Service

Structured Data

DataPipeline

Scraping API

Large-Scale Data Acquisition

Ecommerce

Market Research Firms

SEO Agencies

Travel Agencies and Hotels

VCs and Hedge Funds

AI and ML

SERP Data Collection

Ecommerce Data Collection

Market Research Scraper

Real Estate Data Collection

cURL

Python

NodeJS

PHP

Ruby

Java

DataPipeline

Developer Guides

Free Downloads

Product FAQs

Case Studies

Webinars

Comparisons

Learning Hub

Glossary

Blog

Async Scraper Service

Structured Data

DataPipeline

Scraping API

Large-Scale Data Acquisition

Ecommerce

Market Research Firms

SEO Agencies

Travel Agencies and Hotels

VCs and Hedge Funds

AI and ML

SERP Data Collection

Ecommerce Data Collection

Market Research Scraper

Real Estate Data Collection

cURL

Python

NodeJS

PHP

Ruby

Java

DataPipeline

Developer Guides

Free Downloads

Product FAQs

Case Stuides

Webinars

Comparisons

Learning Hub

Glossary

Blog

Python Scrapy Integration

Getting Started: Sending Requests With ScraperAPI

API Endpoint

SDK

Proxy Mode

Using Additional ScraperAPI Functionality

Sending Custom Headers To ScraperAPI

Configuring Your Scrapy Settings

Concurrency

Retrying Failed Requests

Disable Obeying Robots.txt

Table of Contents

Ready to start scraping?

Ready to start scraping?