In this guide, we’ll see how you can easily use Scraper API with a NodeJS Puppeteer scraper. We will walk you through exactly how to integrate Scraper API and the common mistakes users make so you can get scraping as quickly as possible.

Full code examples can be found on GitHub here.

Not Recommended Method: Using The API Endpoint

A common issue we see users run into when trying to integrate Scraper API into a Puppeteer scraper is that they try to send their requests to our API endpoint. This isn’t the best integration method as you will run into two issues:

Issue #1 – Because you are sending a GET request to the Scraper API endpoint, if there are any other assets/endpoints that need to be requested by Puppeteer on the page, Puppeteer will look for these on the Scraper API endpoint, not the website’s server. 

For example, when making a request to quotes.toscrape.com, it needs to make a request to the relative URL /static/main.css. Which in normal circumstances, would send the request to http://quotes.toscrape.com/static/main.css but when using the API endpoint it would try to make a request to http://api.scraperapi.com/static/main.css which will return an error. As a result, Puppeteer will only get access to the first HTML response (which kinda defeats the purpose of using a headless browser). 

Issue #2 – Similarly to the first issue, if you want to navigate or emulate real user behaviour using Puppeteer you will often have to click on relative links which won’t work when sending requests via the API endpoint for the same reasons as above.

As a result, we highly recommend that you don’t use the API endpoint if you want to integrate a NodeJS Puppeteer scraper with the API.

Recommended Method: Use Proxy Port

To correctly use Puppeteer with Scraper API you should use our proxy mode, like you would any other proxy. The Scraper API proxy port details are as follows:

bash

// ScraperAPI proxy configuration
PROXY_USERNAME = 'scraperapi';
PROXY_PASSWORD = 'API_KEY'; // <-- enter your API_Key here
PROXY_SERVER = 'proxy-server.scraperapi.com';
PROXY_SERVER_PORT = '8001';

First, you need to set the –proxy-server to ScraperAPIs proxy port in the browser options when launching the browser. You should also set the browser to ignore HTTPS errors.

bash

const browser = await puppeteer.launch({
        ignoreHTTPSErrors: true,
        args: [
            `--proxy-server=http://${PROXY_SERVER}:${PROXY_SERVER_PORT}`
        ]
    });

Then when a new page has been created, you need to authenticate the proxy with your username and password.

bash

await page.authenticate({
        username: PROXY_USERNAME,
        password: PROXY_PASSWORD,
      });

Once this is complete then you can use your Puppeteer scraper as you would normally. Whenever Puppeteer makes a request it will now send the request via our proxy port so your IP address will be hidden from the website you are scraping.

Note: Sometimes some conflicts do occur when using Puppeteer with our proxy port. If you experience any issues like certain resources/assets not being loaded then contact our support team and we can take a look into it for you. 

Here is a full code example of a NodeJS Puppeteer scraper that scrapes quotes.toscrape.com using Scraper API as the proxy.