Node.js is a fast-growing, easy-to-use runtime environment made for JavaScript, which makes it perfect for web scraping JavaScript efficiently and with a low barrier to entry.

Because JavaScript is one of the most widely used and supported programming languages, it allows developers to scrape a wide variety of websites.

Plus, JavaScript’s native database structure is JSON, which is the most commonly used database structure across all APIs.

Today, we’re going to learn how to build a web scraper and make it find a specific string of data, no matter whether it is a static or a dynamic page.

If you read through to the end of our guide, we’ll teach you a simple trick to go around most major roadblocks you’ll encounter when scraping websites at scale.

Pre-requisites: What You Need to Know About Scraping Using JavaScript

This tutorial is for junior developers, so we’ll cover all the basics you need to understand to build your first JavaScript web scraper.

However, to get the most out of our guide, we would recommend that you:

  • Have experience with JavaScript (or you’re at least familiar with it)
  • Have basic knowledge of a web page structure, and
  • Know how to use DevTools to extract selectors of elements (optional)

Note: If you’ve never used Javascript before, check out the w3bschool Javascript tutorial, or for a more in-depth course, go through freeCodeCamp’s Javascript course.

 

Of course, web scraping comes with its own challenges, but don’t worry. At the end of this article, we’ll show you a quick solution that’ll make your scraper run smoothly and hassle-free.

Knowing how to build a scraper from scratch is an essential step on your learning journey to becoming a master scraper, so let’s get started.

How to Build a JavaScript Web Scraper for Static Pages

Web scraping can be broken down into two basic steps:

  • Fetching the HTML source code and
  • Parsing the data to collect the information we need.

We’ll explore how to do each of these by gathering the price of an organic sheet set from Turmerry’s website.

1. Install Node.js on your computer

First things first, go to https://nodejs.org/en/download/ to download Node.js and follow the prompts until it’s all done.

The download includes npm, which is a package manager for Node.js. Npm will let us install the rest of the dependencies we need for our web scraper.

After it’s done installing, go to your terminal and type node -v and npm -v to verify everything is working properly.

2. Getting your workspace ready

After Node.js is installed, create a new folder called “firstscraper” and type npm init -y to initialize a package.json file.

Then we’ll install our dependencies by running npm install axios cheerio puppeteer and wait a few minutes for it to install.

* Installing puppeteer will take a little longer as it needs to download chromium as well.

Axios is a promise-based HTTP client for Node.js that allows us to send a request to a server and receive a response. In simple terms, we’ll use Axios to fetch the HTML code of the web page.

On the other hand, Cheerio is a jquery implementation for Node.js that makes it easier to select, edit, and view DOM elements.

We’ll talk more about the last library, puppeteer, when scraping dynamic pages later in this article.

3. Fetch the HTML code using Axios

With everything ready, click on “new file”, name it scraperapi.js, and type the following function to fetch the HTML of the product page we want to collect data from:

const axios = require('axios');
const url = 'https://www.turmerry.com/collections/organic-cotton-sheet-sets/products/percale-natural-color-organic-sheet-sets';
axios(url)
  .then(response => {
    const html = response.data;
    console.log(html);
  })
  .catch(console.error);

We use const axios = require('axios') to declare Axios in our project and add const url and give it the URL of the page we want to fetch.

Axios will send a request to the server and bring a response we’ll storage in const html so we can then call it and print it on the console.

After running the scraper using node scraperapi.js in the terminal, it will pull a long and unreadable string of HTML.

Now, let’s introduce cheerio to parse the HTML and only get the information we are interested in.

4. Select the elements you want to collect

Before we actually add cheerio to our project, we need to identify the elements we want to extract from the HTML.

To do this, we’ll use our browser’s dev tools.

First, we open Turmerry’s product page and press Ctrl + shift + c to open the inspector tool.

There are two different prices on the page. In some cases, you might want to get both prices, but for this example, we want to collect the price they are really selling it for.

When clicking on the $99.00 price, the tool will take you to the corresponding line of code where you can get the element class.

The retail price has a sale-price class applied. Now that we have this information, we can go ahead and add cheerio to our file.

5. Parse the HTML using cheerio

First thing first, add const cheerio = require('cheerio') at the top of your file to import our library into the project and then pass the HTML document to Cheerio using const $ = cheerio.load(html).

After loading the HTML, we’ll use the price’s CSS class using const salePrice = $('.sale-price').text() to store the text containing the class within salePrice.

If you update your code, it should look something like this:

const axios = require('axios');
const cheerio = require('cheerio')
    const url = 'https://www.turmerry.com/collections/organic-cotton-sheet-sets/products/percale-natural-color-organic-sheet-sets';
    axios(url)
      .then(response => {
        const html = response.data;
        const $ = cheerio.load(html)
        const salePrice = $('.sale-price').text()
        console.log(salePrice);
      })
      .catch(console.error);

After running your program, it will print the content tagged as .sale-price to the console:

If we provide more URLs, we’ll be able to collect all selling prices for all products in a fraction of the time it would take us to do it by hand.

Note: if you’d like to dive deeper into this library, check cheerio’s documentation.

Scraping Dynamic Pages? Here’s What to Do

When you navigate to a dynamic page, your browser will need to render JavaScript to access the content before displaying it for you. This might not be an issue for us, but it is for our scraper.

Axios, and any HTTP client for that matter, will not wait for the JavaScript to execute before returning a response. So if we use our scraper as it is right now, we won’t really get anything.

Here’s where Puppeteer will come in handy.

In simple terms, Puppeteer is a node.js library that allows you to control a headless chromium-browser directly from your terminal. You are able to do pretty much anything you can imagine, like scrolling down, clicking, taking screenshots, and more.

If the content you want to scrape won’t load until you execute a script by clicking on a button, you can script these actions using Puppeteer and make the data available for your scraper to take.

For this example, let’s say that you want to create new content around web scraping and thought to scrape the r/webscraping subreddit for ideas by collecting the titles of the posts.

If we inspect this subreddit, we’ll notice a few things right away: first, classes are randomly generated, so there’s no sense in us trying to latch on to them, and second, the titles are tagged as H3, but they are wrapped between anchor tags with a div between the tag and the h3.

Let’s create a new js file, name it scraperapi2.js, and add const puppeteer = require('puppeteer') to import the library into the project. Then, add the following code:

const cheerio = require('cheerio');
const puppeteer = require('puppeteer');let scraped_headlines = [];(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

First, we’re using let scraped_headlines = [ ] to declare where our data will be stored. Then, we write an async function to enable us to use the await operator.

Note: although you could build a scraper using .then( ) callbacks. It will just limit your scraper’s scalability and make it harder to scrape more than one page at a time.

After that’s set, we’re telling Puppeteer to launch the browser, wait (await) for the browser to be launched, and then open a new page.

Now, let’s open a try statement and the next block of code to tell the browser to which URL to go to and for Puppeteer to get the HTML code after it renders:

try {
      await page.goto('https://www.reddit.com/r/webscraping/', {timeout: 180000});
      let bodyHTML = await page.evaluate(() => document.body.innerHTML);

We are already familiar with the next step. Because we got the HTML document, we’ll need to send it to Cheerio so we can use our CSS selectors and get the content we need:

try {

      await page.goto('https://www.reddit.com/r/webscraping/', {timeout: 180000});

      let bodyHTML = await page.evaluate(() => document.body.innerHTML);

      let $ = cheerio.load(bodyHTML);

      let article_headlines = $('a[href*="/r/webscraping/comments"] > div')

      article_headlines.each((index, element) => {

          title = $(element).find('h3').text()
          scraped_headlines.push({
            'title': title
          })
      });

 

      }

In this case, we added 'a[href*="/r/webscraping/comments"] > div’ in the first selector to tell our scraper where to look for the data, before asking it to return the text within the H3 tag with title = $(element).find('h3').text().

Lastly, we use the push()  method to add the word “title:” before every data string. This makes it easier to read.

After updating your code, it will look like this:

const cheerio = require('cheerio');

const puppeteer = require('puppeteer');

 

let scraped_headlines = [];

 

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

 

    try {</span
      await page.goto('https://www.reddit.com/r/webscraping/', {timeout: 180000});
      let bodyHTML = await page.evaluate(() => document.body.innerHTML);
      let $ = cheerio.load(bodyHTML);
      let article_headlines = $('a[href*="/r/webscraping/comments"] > div')
      article_headlines.each((index, element) => {
          title = $(element).find('h3').text()
          scraped_headlines.push({
            'title': title
          })
      });

 

      }  catch(err) {
          console.log(err);
      }

 

      await browser.close();
    console.log(scraped_headlines)
})();

 

You can now test your code using node scraperapi2.js. It will return all the H3 titles it can find on the rendered page:

Note: For a more in-depth look at this library, here’s Puppeteer’s documentation.

Using Scraper API for Faster Data Scraping

As you’ve seen in this tutorial, building a web scraper is a straightforward project with a lot of potential. The real problem with homemade scrapers is scalability.

One of the challenges you’ll be facing is handling- for example –  CAPTCHAs. While running your program, your IP address can get identified as a fraudulent user, getting your IP banned.

If you run your scraper on a server hosted in a data center, you’re even more likely to be blocked instantly. Why? Because data center IPs are less trusted, getting your requests flagged as “non-person requests”.

All these roadblocks make web scraping so challenging and are why we developed an API that can handle all of them automatically: from rotating IP addresses, handling CAPTCHAs, and rendering JavaScript, ScraperAPI handles everything with a single API call.

To learn how to integrate Scraper API with our scrapers, we’ll need to create a new Scraper API account – you’ll get 1000 free API calls right off the bat, so you have more than enough for trying the API out.

After signing up, you’ll get access to your API key and some sample code for you to use as a reference.

With that set, let’s start with our Axios scraper:

Integrating Scraper API with Axios requests

This is super straightforward. All we need to do is to add our API key as a const and then tell Axios to use Scraper API endpoint:

let axios = require('axios');
const cheerio = require('cheerio')
const API_KEY = ''YOUR_API_KEY''
const url = 'https://www.turmerry.com/collections/organic-cotton-sheet-sets/products/percale-natural-color-organic-sheet-sets';
axios('http://api.scraperapi.com/', {params: {
    'url': url,
    'api_key': API_KEY,
}})

Now, every request will go through Scraper API, and it will return with the HTML we can then pass to Cheerio as before.

Also, because it’s fully integrated with our scraper, we can add other parameters to our code to add more functionalities through the API.

An excellent use cause is if you want to scrape a JavaScript site without using a headless browser.

By setting render=true, Scraper API will use a headless chromium browser to execute the script and return with the full loaded HTML.

Here’s the final Axios + Cheerio code:

let axios = require('axios');
const cheerio = require('cheerio')const API_KEY = '51e43be283e4db2a5afb62660fc6ee44'
const render = 'true'

 

const url = 'https://www.turmerry.com/collections/organic-cotton-sheet-sets/products/percale-natural-color-organic-sheet-sets';

 

axios('http://api.scraperapi.com/', {params: {
    'url': url,
    'api_key': API_KEY,
    'render' : render
}})
.then(response => {
const html = response.data;
const $ = cheerio.load(html)
const salePrice = $('.sale-price').text()
console.log(salePrice);
})
.catch(console.error);

Integrating Scraper API with Puppeteer

You probably are thinking: if I can render JavaScript with Scraper API, why would I need a Puppeteer implementation?

The reason is simple. Scraper API will execute the JavaScript necessary for the page to load. However, if you need to interact with the page (like to scroll or click a button), you’ll need to use your own headless browser – in this case, Puppeteer.

For this example we’ll add the following code to set our proxy configuration, right after declaring our dependencies:

const cheerio = require('cheerio');
const puppeteer = require('puppeteer');
PROXY_USERNAME = 'scraperapi';
PROXY_PASSWORD = 'API_KEY'; // <-- enter your API_Key here
PROXY_SERVER = 'proxy-server.scraperapi.com';
PROXY_SERVER_PORT = '8001';

Next we set our scraper to use Scraper API as a proxy within our async function:

(async () => {
  const browser = await puppeteer.launch({
      ignoreHTTPSErrors: true,
      args: [
          `--proxy-server=http://${PROXY_SERVER}:${PROXY_SERVER_PORT}`
      ]
  });
  const page = await browser.newPage();
  await page.authenticate({
      username: PROXY_USERNAME,
      password: PROXY_PASSWORD,
    });

After updating your code, it should look like this:

const cheerio = require('cheerio');
const puppeteer = require('puppeteer');

 

PROXY_USERNAME = 'scraperapi';
PROXY_PASSWORD = '51e43be283e4db2a5afb6266xxxxxxxx';
PROXY_SERVER = 'proxy-server.scraperapi.com';
PROXY_SERVER_PORT = '8001';

 

let scraped_headlines = [];

 

(async () => {
  const browser = await puppeteer.launch({
      ignoreHTTPSErrors: true,
      args: [
          `--proxy-server=http://${PROXY_SERVER}:${PROXY_SERVER_PORT}`
      ]
  });

 

  const page = await browser.newPage();
  await page.authenticate({
      username: PROXY_USERNAME,
      password: PROXY_PASSWORD,
    });

 

    try {
      await page.goto('https://www.reddit.com/r/webscraping/', {timeout: 180000});
      let bodyHTML = await page.evaluate(() => document.body.innerHTML);
      let $ = cheerio.load(bodyHTML);
      let article_headlines = $('a[href*="/r/webscraping/comments"] > div')
      article_headlines.each((index, element) => {
          title = $(element).find('h3').text()
          scraped_headlines.push({
            'title': title
          })
      });

 

      }  catch(err) {
          console.log(err);
      }

 

      await browser.close();
    console.log(scraped_headlines)
})();

 

And there you go, your API is ready to use!

For getting the most out of your account, you can follow this Scraper API cheat sheet. There you’ll find the best practices for web scraping using our API and the major challenges you’ll face in more detail.

Also, save this puppeteer integration with Scraper API sample code to follow every time you’re building a new project. It will definitely cut some coding time.

We hope you enjoyed this tutorial and that you learned a thing or two from it. If you have any questions, don’t hesitate to contact our support team, they’ll be happy to help.

Happy scraping!