Integrating ScraperAPI with NodeJS & Cheerio

May, 2021

In this guide, we’ll see how you can easily use ScraperAPI with a NodeJS scraper to scrape the web at scale. We will walk you through exactly how to create a scraper that will:

Send requests to ScraperAPI using our API endpoint, NodeJS SDK, or proxy port.
Automatically catch and retry failed requests returned by ScraperAPI.
Spread your requests over multiple concurrent threads so you can scale up your scraping to millions of pages per day.

Full code examples can be found on GitHub here.

Getting Started: Sending Requests With ScraperAPI

Using ScraperAPI as your proxy solution is very straightforward. All you need to do is send us the URL you want to scrape to us via our API endpoint, NodeJS SDK, or proxy port and we will manage everything to do with proxy/header rotation, automatic retries, ban detection, and CAPTCHA bypassing.

The following is a simple implementation that will iterate through a list of URLs and request each of them via ScraperAPI, returning the HTML response from quotes.toscrape.com using ScraperAPI as the proxy.

const rp = require('promise-request-retry');

const API_KEY = 'INSERT_API_KEY_HERE'; 
const URL = 'http://quotes.toscrape.com/page/1/'

options = {
    uri: `http://api.scraperapi.com/`,
    qs: {
        'api_key': API_KEY,
        'url': URL
    },
    resolveWithFullResponse: true
}

rp(options)
.then(response => {
    console.log(response.body)
})
.catch(error => {
    console.log(error)
})

Here are a couple of other points to note:

Timeouts – When you send a request to the API we will automatically select the best proxy/header configuration to get a successful response. However, if the response isn’t valid (ban, CAPTCHA, taking too long) then the API will automatically retry the request with a different proxy/header configuration. We will continue this cycle for up to 60 seconds until we either get a successful response or we return a 500 error code to you. To ensure this process runs smoothly, you need to make sure you don’t set a timeout or it is set to at least 60 seconds.
SSL Cert Verification – In order for your requests to work properly with the API when using proxy mode your code must be configured to not verify SSL certificates.
Request Size – You can scrape images, PDFs, or other files just as you would any other URL, just remember that there is a 2MB limit per request.

Configuring Your Code To Retry Failed Requests

For most sites, over 97% of your requests will be successful on the first try, however, it is inevitable that some requests will fail. For these failed requests, the API will return a 500 status code and won’t charge you for the request.

In this case, if you set your code to automatically retry these failed requests 99.9% will be successful after 3 retries unless there is an issue with the site.

Here is some example code, showing you how you can automatically retry failed requests returned by ScraperAPI with the request-promise-retry library. We recommend that you set your number of retries to at least 3.

const rp = require('promise-request-retry');

const API_KEY = 'INSERT_API_KEY_HERE'; 
const NUM_RETRIES = 5;

const URL = 'http://quotes.toscrape.com/page/1/'

options = {
    uri: `http://api.scraperapi.com/`,
    qs: {
        'api_key': API_KEY,
        'url': URL
    },
    retry : NUM_RETRIES, 
    verbose_logging : false,
    accepted: [ 200, 404, 403 ], 
    delay: 5000, 
    factor: 2,
    resolveWithFullResponse: true
}

rp(options)
.then(response => {
    console.log(response.body)
})
.catch(error => {
    console.log(error)
})

Use Multiple Concurrent Threads To Increase Scraping Speed

ScraperAPI is designed to allow you to increase your scraping from a couple of hundred pages per day to millions of pages per day, simply by changing your plan to have a higher concurrent thread limit.

The more concurrent threads you have the more requests you can have active in parallel, and the faster you can scrape.

If you are new to high-volume scraping, it can sometimes be a bit tricky to set up your code to maximize the number of concurrent threads you have available in your plan. So to make it as simple as possible for you to get set up we’ve created an example scraper that you can easily change for your use case.

For the purposes of these examples we’re going to be scraping the Quotes to Scrape and console.log the scraped data, however, this code will work on any website (except the parsing logic).

const rp = require('promise-request-retry');
const cheerio = require("cheerio");

/*
SCRAPER SETTINGS

You need to define the following values below:

- API_KEY --> Find this on your dashboard, or signup here to create a 
                free account here https://scraperapi.com/signup

- NUM_CONCURRENT_THREADS --> Set this equal to the number of concurrent threads available
                in your plan. For reference: Free Plan (5 threads), Hobby Plan (10 threads),
                Startup Plan (25 threads), Business Plan (50 threads), 
                Enterprise Plan (up to 5,000 threads).

- NUM_RETRIES --> We recommend setting this to 5 retries. For most sites 
                95% of your requests will be successful on the first try,
                and 99% after 3 retries. 

*/


const API_KEY = 'INSERT_API_KEY_HERE'; 
const NUM_CONCURRENT_THREADS = 5;
const NUM_RETRIES = 5;


// Example list of URLs to scrape
const urlsToScrape = [
    'http://quotes.toscrape.com/page/1/',
    'http://quotes.toscrape.com/page/2/',
    'http://quotes.toscrape.com/page/3/',
    'http://quotes.toscrape.com/page/4/',
    'http://quotes.toscrape.com/page/5/',
    'http://quotes.toscrape.com/page/6/',
    'http://quotes.toscrape.com/page/7/',
    'http://quotes.toscrape.com/page/8/',
    'http://quotes.toscrape.com/page/9/'
  ];

let freeThreads = NUM_CONCURRENT_THREADS;
let responsePromises = []

// Store scraped data in this list
let scrapedData = [];


const wait = ms => new Promise(resolve => setTimeout(() => resolve(true), ms));


const checkFreeThreads = (availableThreads, maxThreads) => {
    /*
        Function that returns True or False depending on if there is a concurrent thread 
        free or not. Used to manage the scrapers concurrency.
    */
    if(0 < availableThreads && availableThreads <= maxThreads){ return true } else { return false } } const makeConcurrentRequest = async (inputUrl) => {
    /*
        Function that makes a request with the request-promise-retry library, while 
        also incremeneting/decrementing the available number of concurrent threads
        available to the scraper.
    */
    freeThreads--
    try {
        options = {
            uri: `http://api.scraperapi.com/`,
            qs: {
                'api_key': API_KEY,
                'url': inputUrl
            },
            retry : NUM_RETRIES, 
            verbose_logging : false,
            accepted: [ 200, 404, 403 ], 
            delay: 5000, 
            factor: 2,
            resolveWithFullResponse: true
        }
        const response = await rp(options);
        freeThreads++
        return response
    } catch (e) {
        freeThreads++
        return e
    }
}




(async () => {
    /*
        MAIN SCRAPER SCRIPT
        While there are still urls left to scrape, it will make requests and 
        parse the response whilst ensuring the scraper doesn't exceed the 
        number of concurrent threads available in the ScraperAPI plan.
    */

    while(urlsToScrape.length > 0){

        if(checkFreeThreads(freeThreads, NUM_CONCURRENT_THREADS)){

            // take URL from the list of URLs to scrape
            url = urlsToScrape.shift()

            try {
                // make request and return promise
                response = makeConcurrentRequest(url)

                // log promise so we can make sure all promises resolved before exiting scraper
                responsePromises.push(response)

                // once response is received then parse the data from the page
                response.then(fullResponse => {
                    
                    // before parsing, check to see if response is valid.
                    if(fullResponse.statusCode == 200){

                        // load html with cheerio
                        let $ = cheerio.load(fullResponse.body);

                        // find all quotes sections
                        let quotes_sections = $('div.quote')

                        // loop through the quotes sections and extract data
                        quotes_sections.each((index, element) => {
                            quote = $(element).find('span.text').text()
                            author = $(element).find('small.author').text()

                            // add scraped data to scrapedData array
                            scrapedData.push({
                                'quote': quote,
                                'author': author
                            })

                        });

                    } else {
                        // if the response status code isn't 200, then log the message
                        console.log(fullResponse.message)
                    }

                }).catch(error => {
                    console.log(error)
                })
   
            } catch (error){
                console.log(error)
            }
                
        }
        // if no freeThreads available then wait for 200ms before retrying.
        await wait(200);
    
    } // end of while loop

    
    // don't output scraped data until all promises have been resolved
    Promise.all(responsePromises).then(() => {
        console.log('scrapedData: ', scrapedData); 
    });


})();

These code examples can be found on GitHub here.