In this guide, we’ll see how you can easily use ScraperAPI with a NodeJS scraper to scrape the web at scale. We will walk you through exactly how to create a scraper that will:
- Send requests to ScraperAPI using our API endpoint, NodeJS SDK, or proxy port.
- Automatically catch and retry failed requests returned by ScraperAPI.
- Spread your requests over multiple concurrent threads so you can scale up your scraping to millions of pages per day.
Full code examples can be found on GitHub here.
Getting Started: Sending Requests With ScraperAPI
Using ScraperAPI as your proxy solution is very straightforward. All you need to do is send us the URL you want to scrape to us via our API endpoint, NodeJS SDK, or proxy port and we will manage everything to do with proxy/header rotation, automatic retries, ban detection, and CAPTCHA bypassing.
The following is a simple implementation that will iterate through a list of URLs and request each of them via ScraperAPI, returning the HTML response from quotes.toscrape.com using ScraperAPI as the proxy.
Here are a couple of other points to note:
- Timeouts – When you send a request to the API we will automatically select the best proxy/header configuration to get a successful response. However, if the response isn’t valid (ban, CAPTCHA, taking too long) then the API will automatically retry the request with a different proxy/header configuration. We will continue this cycle for up to 60 seconds until we either get a successful response or we return a 500 error code to you. To ensure this process runs smoothly, you need to make sure you don’t set a timeout or it is set to at least 60 seconds.
- SSL Cert Verification – In order for your requests to work properly with the API when using proxy mode your code must be configured to not verify SSL certificates.
- Request Size – You can scrape images, PDFs, or other files just as you would any other URL, just remember that there is a 2MB limit per request.
Configuring Your Code To Retry Failed Requests
For most sites, over 97% of your requests will be successful on the first try, however, it is inevitable that some requests will fail. For these failed requests, the API will return a 500 status code and won’t charge you for the request.
In this case, if you set your code to automatically retry these failed requests 99.9% will be successful after 3 retries unless there is an issue with the site.
Here is some example code, showing you how you can automatically retry failed requests returned by ScraperAPI with the request-promise-retry library. We recommend that you set your number of retries to at least 3.
Use Multiple Concurrent Threads To Increase Scraping Speed
ScraperAPI is designed to allow you to increase your scraping from a couple of hundred pages per day to millions of pages per day, simply by changing your plan to have a higher concurrent thread limit.
The more concurrent threads you have the more requests you can have active in parallel, and the faster you can scrape.
If you are new to high-volume scraping, it can sometimes be a bit tricky to set up your code to maximize the number of concurrent threads you have available in your plan. So to make it as simple as possible for you to get set up we’ve created an example scraper that you can easily change for your use case.
For the purposes of these examples we’re going to be scraping the Quotes to Scrape and console.log the scraped data, however, this code will work on any website (except the parsing logic).