Python Requests and Beautifulsoup Integration

In this guide, we’ll see how you can easily use ScraperAPI with the Python Request library to scrape the web at scale. We will walk you through exactly how to create a scraper that will:

  • Send requests to ScraperAPI using our API endpoint, Python SDK or proxy port.
  • Automatically catch and retry failed requests returned by ScraperAPI.
  • Spread your requests over multiple concurrent threads so you can scale up your scraping to millions of pages per day.

Full code examples can be found on GitHub here.

Getting Started: Sending Requests With ScraperAPI

Using ScraperAPI as your proxy solution is very straightforward. All you need to do is send us the URL you want to scrape to us via our API endpoint, Python SDK or proxy port and we will manage everything to do with proxy/header rotation, automatic retries, ban detection and CAPTCHA bypassing.

The following is a simple implementation that will iterate through a list of URLs and request each of them via ScraperAPI, returning the HTML response as the response.

bash
import requests
from urllib.parse import urlencode

list_of_urls = ['http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/']

for url in list_of_urls: 
    params = {'api_key': API_KEY, 'url': url}
    response = requests.get('http://api.scraperapi.com/', params=urlencode(params))
    print(response.text)

Here are a couple of other points to note:

  • Timeouts –  When you send a request to the API we will automatically select the best proxy/header configuration to get a successful response. However, if the response isn’t valid (ban, CAPTCHA, taking too long) then the API will automatically retry the request with a different proxy/header configuration. We will continue this cycle for up to 60 seconds until we either get a successful response or we return a 500 error code to you. To ensure this process runs smoothly, you need to make sure you don’t set a timeout or set it to at least 60 seconds. 
  • SSL Cert Verification – In order for your requests to work properly with the API when using proxy mode your code must be configured to not verify SSL certificates. When using Python Requests this is as simple as adding the flag verify=False to the request.
  • Request Size – You can scrape images, PDFs or other files just as you would any other URL, just remember that there is a 2MB limit per request.

Configuring Your Code To Retry Failed Requests

For most sites, over 97% of your requests will be successful on the first try, however, it is evitable that some requests will fail. For these failed requests, the API will return a 500 status code and won’t charge you for the request.

In this case, if you set your code to automatically retry these failed requests 99.9% will be successful after 3 retries unless there is an issue with the site.

Here is some example code, showing you how you can automatically retry failed requests returned by ScraperAPI. We recommend that you set your number of retries to at least 3 retries.

bash
import requests
from bs4 import BeautifulSoup

list_of_urls = ['http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/']
NUM_RETRIES = 3

scraped_quotes = []

for url in list_of_urls: 
    params = {'api_key': API_KEY, 'url': url}
    for _ in range(NUM_RETRIES):
        try:
            response = requests.get('http://api.scraperapi.com/', params=urlencode(params))
            if response.status_code in [200, 404]:
                ## escape for loop if the API returns a successful response
                break
        except requests.exceptions.ConnectionError:
            response = ''
    
    ## parse data if 200 status code (successful response)
    if response.status_code == 200:
        
        """
        Insert the parsing code for your use case here...
        """
        
        ## Example: parse data with beautifulsoup
        html_response = response.text
        soup = BeautifulSoup(html_response, "html.parser")
        quotes_sections = soup.find_all('div', class_="quote")
        
        ## loop through each quotes section and extract the quote and author
        for quote_block in quotes_sections:
            quote = quote_block.find('span', class_='text').text
            author = quote_block.find('small', class_='author').text
            
            ## add scraped data to "scraped_quotes" list
            scraped_quotes.append({
                'quote': quote,
                'author': author
            })
    
print(scraped_quotes)

As you might have noticed there is no retry code if you are using the ScraperAPI SDK. This is because we’ve built it into the SDK for you. The default retry setting is 3 retries, however, you can override this by setting the retry flag to retry=5 for example.

Use Multiple Concurrent Threads To Increase Scraping Speed

ScraperAPI is designed to allow you to increase your scraping from a couple hundred pages per day, to millions of pages per day, simply by changing your plan to have a higher concurrent thread limit.

The more concurrent threads you have the more requests you can have active in parallel, and the faster you can scrape.

If you are new to high volume scraping, it can sometimes be a bit tricky to set up your code to maximise the number of concurrent threads you have available in your plan. So to make it as simple as possible for you to get set up we’ve created an example scraper that you can easily change for your use case.

For the purposes of these examples we’re going to be scraping the Quotes to Scrape and saving the scraped data to a csv file, however, this code will work on any website (except the parsing logic).

bash
import requests
from bs4 import BeautifulSoup
import concurrent.futures
import csv
import urllib.parse

API_KEY = 'INSERT_API_KEY_HERE'
NUM_RETRIES = 3
NUM_THREADS = 5


## Example list of urls to scrape
list_of_urls = [
            'http://quotes.toscrape.com/page/1/',
           'http://quotes.toscrape.com/page/2/',
        ]


## we will store the scraped data in this list
scraped_quotes = []

def scrape_url(url):
    
    params = {'api_key': API_KEY, 'url': url}
   
    # send request to scraperapi, and automatically retry failed requests
    for _ in range(NUM_RETRIES):
        try:
            response = requests.get('http://api.scraperapi.com/', params=urlencode(params))
            if response.status_code in [200, 404]:
                ## escape for loop if the API returns a successful response
                break
        except requests.exceptions.ConnectionError:
            response = ''
    
    
    ## parse data if 200 status code (successful response)
    if response.status_code == 200:
        
        ## Example: parse data with beautifulsoup
        html_response = response.text
        soup = BeautifulSoup(html_response, "html.parser")
        quotes_sections = soup.find_all('div', class_="quote")
        
        ## loop through each quotes section and extract the quote and author
        for quote_block in quotes_sections:
            quote = quote_block.find('span', class_='text').text
            author = quote_block.find('small', class_='author').text
            
            ## add scraped data to "scraped_quotes" list
            scraped_quotes.append({
                'quote': quote,
                'author': author
            })

with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
    executor.map(scrape_url, list_of_urls)


print(scraped_quotes)

 

These code examples can be found on GitHub here.

Table of Contents

Ready to start scraping?

Get started with 5,000 free API credits or contact sales

No credit card required