Today, websites implement more advanced anti-scraping measures, requiring developers to find more innovative approaches to bypass detection while maintaining efficiency.
Puppeteer does just that, thanks to its headless browser feature, which allows users to programmatically control and interact with webpages as a regular user does.
In this guide, you’ll learn how to use Puppeteer, a headless Chrome automation library, with Node.js to scrape web pages efficiently. We’ll cover everything from how Puppeteer works to:
- Setting up our Node.js projects
- Handling dynamic content
- Working with proxies in Puppeteer
- Bypassing common roadblocks when building a web scraper
Let’s get started!
TL;DR: Puppeteer Web Scraping Basics
Scraping a web page with Puppeteer can be done in three steps:
- Create a browser instance and open a new tab.
- Set the URL address in the browser tab and navigate to it.
- Once the page is loaded, download the content in HTML.
const puppeteer = require('puppeteer');
const PAGE_URL = "https://www.timeanddate.com/weather/";
const main = async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(PAGE_URL);
const html = await page.content();
await browser.close();
console.log(html);
}
void main();
You can now use a DOM parser to build the page’s HTML structure and extract its content.
What is Puppeteer, and how does it work?
Puppeteer is a Node.js library that provides a high-level API to control Chrome or Chromium browsers and is widely used for tasks like web scraping, browser automation, automated testing, and generating PDFs/screenshots of web pages.
Puppeteer can operate in either headless mode (without a visible browser window) or headful mode (with a visible browser UI), depending on your needs.
Why Puppeteer for Web Scraping in 2025?
Puppeteer remains a top choice for JavaScript-heavy sites and browser automation because it is designed to work seamlessly with Chromium-based browsers and has several key advantages that make it stand out:
- You can fully control a browser and perform many actions, such as emulating user interactions, executing native JavaScript code, taking a page screenshot, and interacting with cookies and local storage.
- You can interact with heavy websites with dynamic content and wait for selectors, network activity, or arbitrary JavaScript conditions before proceeding.
- You can use the headless mode, which disables the graphical user interface and allows you to build a web scraper that controls the browser instance.
- The wide adoption and the large developer community keep it well maintained, making it easy to find help and have a set of plugins that extend the Puppeteer features.
Project Requirements for Scraping with Puppeteer
Puppeteer runs inside a JavaScript runtime like Node.js, so you must install it first. The latest long-term support version of Node.js is 22.13.0.
Install Node.js
Run the following commands to install Node.js on a Linux distribution and MacOS:
# Download and install nvm:
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash
# Download and install Node.js:
nvm install 22.13.0
# Verify the Node.js version:
node -v # Should print "v22.13.0".
nvm current # Should print "v22.13.0".
# Verify npm version:
npm -v # Should print "10.9.2".
To install Node.js on Windows, run the following commands:
# Download and install fnm:
winget install Schniz.fnm
# Download and install Node.js:
fnm install 22.13.0
# Verify the Node.js version:
node -v # Should print "v22.13.0".
# Verify npm version:
npm -v # Should print "10.9.2".
Install Puppeteer
Puppeteer is published on the Node.js Package Manager under two packages:
- Puppeteer: It is the most complete version containing the library for web scraping. You can also download a compatible Google Chrome version with the library.
- Puppeteer Core: It is a lightweight version containing only the library for web scraping.
Note: Install Puppeteer Core if you don’t manage the web browser and only interact with browser runtime that supports DevTools protocol.
Run the following commands to install Puppeteer.
npm install puppeteer
Depending on the operating system, the installation will download between 170MB and 280MB.
How to Use Puppeteer with Node.js for Web Scraping
To showcase the web scraping features of Puppeteer, we will build a web scraper that retrieves the Eventbrite online events data.
The data is dynamically loaded on the web page with JavaScript; this is an interesting use case for web scraping.
For each online event, we will retrieve the following information:
- The name
- The date
- The price
- The organizer
- The followers count
- The link
We will store the data extracted from the website in a JSON file.
Step 1: Set up the Node.js project
Let’s start by creating a folder that will hold the web scraper’s code source and initializing a Node.js project.
mkdir puppeteer-web-scraping
cd puppeteer-web-scraping
npm init -y
The last command will create a package.json file in the folder. Next, create a file index.js and add simple JavaScript instructions inside.
touch index.js
echo "console.log('Hello world!');" > index.js
Run the file index.js using the Node.js runtime.
node index.js
This execution will print the message Hello world!
in the terminal.
Step 2: Install dependencies
We will need the following two Node.js packages to build the web scraper:
- Puppeteer: We will use it to open the web page, wait for the events to be loaded dynamically, and download the HTML content.
- Cheerio: We will use it to extract the information from the HTML content downloaded by Puppeteer.
Run the command below to install these packages:
npm install puppeteer cheerio
Step 3: Identify the DOM selectors to Target
Navigate to the Eventbrite online events page and wait for the events to load. Once the events are loaded, inspect the page to display the HTML structure and identify the DOM selector associated with the HTML tag wrapping the information we want to extract.
From the above picture, we can define the DOM selectors the web scraper will use to target and extract each event property. The table below associates each event property to its DOM selector.
Information | DOM selector |
Name | .event-card-details h3 |
Date | .event-card-details a + p |
Price | .event-card-details div > div > p:first-child |
Organizer | .event-card-details div > div + div > p |
Followers count | .event-card-details div > div + div > p + div |
Event link | .event-card-details a |
Step 4: Scrape the web page with Puppeteer
Using Puppeteer, we will navigate to the web page and download the content as HTML. Replace the content of the index.js file with the following code:
const puppeteer = require('puppeteer');
const PAGE_URL = 'https://eventbrite.com/d/online/events/';
const main = async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto(PAGE_URL, { waitUntil: 'networkidle0' });
await page.waitForSelector('.eds-icon--xsmall');
const html = await page.evaluate(() => document.body.innerHTML);
await browser.close();
console.log(html);
}
void main();
Let’s highlight two things in the above code:
- When navigating to the web page, use the option
waitUntil: networkidle0
to ensure the puppeteer waits until there is no browser network call before downloading the page’s content. This gives time for the dynamic content to be displayed. - Because the previous option is not always effective, we use the method
waitForSelector
to wait for a specific selector to be present in the DOM. This selector usually comes from dynamic content loaded on the page.
Once the web page is fully loaded, we download its content and print it in the console. Run the code with the command node index.js
; you will get the following output:
Step 5: Extract information from the web page
Our web scraper can download the content of the web page with Puppeteer. Now, we must parse and extract the data inside. We installed the Node.js library Cheerio to handle this.
Concretely, we will load the HTML content to Cheerio, which will parse it and build the DOM tree. It allows us to retrieve the content by targeting the DOM selector we identified in step three.
Replace the content of the index.js file with the code below:
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
const PAGE_URL = 'https://eventbrite.com/d/online/events/';
const main = async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
console.log('Downloading page...');
await page.goto(PAGE_URL, { waitUntil: 'networkidle0' });
await page.waitForSelector('.eds-icon--xsmall');
const html = await page.evaluate(() => document.body.innerHTML);
await browser.close();
console.log('Extracting data from the page...');
const $ = cheerio.load(html);
const eventDomElements = $('.event-card-details');
const events = [];
eventDomElements.each((_, element) => {
const title = $(element).find('h3').text();
const link = $(element).find('a').attr('href');
const date = $(element).find('a + p').text();
const price = $(element).find('div > div > p:first-child').eq(0).text();
const organizer = $(element).find('div > div + div > p').text();
const followers = $(element).find('div > div + div > p + div').text();
events.push({
title,
link,
date,
price,
organizer,
followers
});
});
console.log('Events extracted:', events.length);
console.log(events);
}
void main();
We add each extracted element to an array. Once all elements are extracted, print the array in the console. Run the code to see the result.
Step 6: Save the data scraped into a file
To use the extracted data, we should export it into a format that is easily shareable between third parties, store it in a database, enrich knowledge, or expose it through a Web API.
We will store the extracted data in a JSON file online-events.json. Update the file index.js to add the code below:
const fs = require('node:fs');
// Existing package import here...
const main = () => {
// Existing code here...
console.log('Saving data to file...');
fs.writeFileSync('./online-events.json', JSON.stringify(events, null, 2));
};
void main();
Re-run the code to see the JSON file in the project directory containing the events data extracted with the web scraper.
Advanced Puppeteer Features
We saw the basic Puppeteer features to perform web scraping by loading a dynamic webpage and retrieving a downloaded HTML page. Still, we can do more interactions on a web page to retrieve more data, such as infinite scrolling, crawling, performing clicks, and typing in user input.
Perform Infinite Scrolling with Puppeteer
Infinite scrolling consists of loading and displaying more data to users as they scroll on a web page; most social media platforms display posts this way.
Let’s retrieve data on this website listing remote jobs to show how you can do infinite scrolling with Puppeteer. We will retrieve the job name, the company, the salary, and the publish date.
Before performing the steps we showed earlier to scrape a website with Puppeteer, you must identify the page’s scroll container and selector. If you can’t find them, you can use the window element, as we will do later.
Create a new file, infinite-scroll.js, and add the code below:
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
const PAGE_URL = 'https://remoteok.com/';
const waitFor = (timeInMs) => new Promise(r => setTimeout(r, timeInMs));
const main = async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(PAGE_URL, { waitUntil: 'networkidle0' });
const TOTAL_SCROLL = 5;
let scrollCounter = 0;
while (scrollCounter < TOTAL_SCROLL) {
await page.evaluate(() => {
const SCROLL_STEP = 2000;
window.scrollBy(0, SCROLL_STEP);
});
await waitFor(1000);
scrollCounter++;
}
await waitFor(2000);
const html = await page.content();
await browser.close();
console.log('Extracting data from the page...');
const $ = cheerio.load(html);
const jobDomElements = $('.job');
const jobs = [];
jobDomElements.each((_, element) => {
const title = $(element).find('.company_and_position h2').text().replaceAll(/[\t\n]+/g, '');
const company = $(element).find('.companyLink').text().replaceAll(/[\t\n]+/g, '');
const salary = $(element).find('.company_and_position .location:last-child').text().replaceAll(/[\t\n]+/g, '');
const date = $(element).find('.time').text().replaceAll(/[\t\n]+/g, '');
jobs.push({
title,
company,
salary,
date
});
});
console.log(`Number of jobs extracted: ${jobs.length}`);
console.log(jobs);
}
void main();
After navigating to the web page, we define the number of scrolls to perform in the variable TOTAL_SCROLL
and the scroll height in pixels in the variable SCROLL_STEP
. Every second, we perform a scroll with window.scrollBy(0, SCROLL_STEP);
until we reach the total scroll.
Note: Increase the total scroll or the scroll height to load more data on the web page.
Run the command node infinite-scroll.js
to execute the file and see the output:
Without the infinite scroll, we would have extracted only around 30 jobs.
Note: To scroll on a specific element instead of the whole window, use the following code: document.querySelector("").scrollBy(0, SCROLL_STEP)
.
Perform Crawling with Puppeteer
Crawling with Puppeteer involves visiting multiple web pages, extracting data, and optionally following links to crawl deeper into a website. This is helpful for websites with dynamic content.
Finding a good use case for web crawling with Puppeteer is not straightforward because the web page content found from links rarely has the same structure, making it harder to have a single data extraction function.
For example, the code below shows how to use Puppeteer to build a web crawler. It takes a URL and retrieves the content of links with a maximum depth of two.
const puppeteer = require('puppeteer');
const PAGE_URL = 'https://example.com';
const MAX_DEPTH = 2;
const pageContents = [];
const main = async () => {
const visitedUrls = new Set();
const crawl = async (url, depth) => {
if (visitedUrls.has(url) || depth > MAX_DEPTH) {
return;
}
console.log(`Crawling: ${url} (Depth: ${depth})`);
visitedUrls.add(url);
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
try {
await page.goto(url, { waitUntil: 'domcontentloaded' });
// Extract page title as an example
const content = await page.content();
pageContents.push({ url, content });
// Extract all links on the page
const links = await page.$$eval('a', (anchors) =>
anchors.map((a) => a.href).filter((href) => href.startsWith('http'))
);
console.log(`Found ${links.length} links`);
// Close the page to save resources
await browser.close();
// Recursively crawl the found links
for (const link of links) {
await crawl(link, depth + 1);
}
} catch (error) {
console.error(`Error crawling ${url}:`, error.message);
await browser.close();
}
};
await crawl(PAGE_URL, 1);
};
void main();
Perform Clicking and Input Form Edition with Puppeteer
On the website for remote jobs, let’s say we want to search for Content Writing jobs. We will type the keyword in the text search input and click on the first result item in the suggestions result.
To achieve this, we must find the DOM selector for the search input and the search result list.
Create a new file, page-interaction.js, and add the code below:
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
const PAGE_URL = 'https://remoteok.com/';
const waitFor = (timeInMs) => new Promise(r => setTimeout(r, timeInMs));
const main = async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(PAGE_URL, { waitUntil: 'networkidle0' });
// Type into search input
await page.type('input.search', 'content writing');
// Wait for search results to appear
const searchResultSelector = '.search-filter-results';
await page.waitForSelector(searchResultSelector);
// Click on first result
await page.click('.search-filter-results div:first-child');
await waitFor(2000);
const html = await page.content();
await browser.close();
console.log('Extracting data from the page...');
const $ = cheerio.load(html);
const jobDomElements = $('.job');
const jobs = [];
jobDomElements.each((_, element) => {
const title = $(element).find('.company_and_position h2').text().replaceAll(/[\t\n]+/g, '');
const company = $(element).find('.companyLink').text().replaceAll(/[\t\n]+/g, '');
const salary = $(element).find('.company_and_position .location:last-child').text().replaceAll(/[\t\n]+/g, '');
const date = $(element).find('.time').text().replaceAll(/[\t\n]+/g, '');
jobs.push({
title,
company,
salary,
date
});
});
console.log(`Number of jobs extracted: ${jobs.length}`);
console.log(jobs);
}
void main();
Run the command node page-interaction.js
to execute the file and see the output:
Now, you can build a robust web scraper by adding all these advanced Puppeteer features.
Common Challenges When Web Scraping with Puppeteer & Solutions
When doing web scraping with Puppeteer, there are some challenges to consider. Addressing them will help build an efficient web scraper.
Web Scraping Speed
Web scrapers are interesting because they can collect a huge amount of data in a much shorter time than a human. However, they have a drawback because by performing actions fast on websites, they don’t mimic normal user behavior. Also, collecting data at this speed overloads the website’s servers, thus making the website slow or unavailable.
Websites can put in place solutions to detect these behaviors and prevent the web scraper from working, such as:
- Returning an error if the web scraper IP address sends too many requests in a short time range.
- Blocking the IP address used by the web scraper to send a request so it cannot send requests anymore without changing the IP address.
- Implementing device fingerprinting to block scrapers making too many requests with the same fingerprint.
To solve these challenges, we need to understand how bot detection works and then create measures to bypass them.
Bot Detection
Companies that own websites build solutions to protect themselves from web scraping. The first main challenge is distinguishing human behavior from the bot one.
Here are some ways websites can spot your bots:
- Verify the User Agent in the request headers to ensure it comes from a web browser. In our web scraper, we can set a User Agent before sending the request.
- Identify weird user behaviors such as fast scrolling, clicks, typing, navigation between web pages, etc. In our web scraper, we can set a timeout between two actions.
- Ask the user to pass a CAPTCHA challenge because a web scraper expects a specific web page structure; putting another one in front will make the web scraper fail to retrieve data from the web page.
Addressing these challenges when building a web scraper can be tedious and time-consuming. Using a proxy can help you overcome these challenges.
Using Proxies with Puppeteer
With the web scraper we built earlier, let’s try to scrape another job-posting website to retrieve jobs in Houston.
We downloaded the content of the captcha validation page instead of the targeted website page content. To overcome this challenge, let’s use Puppeteer with a proxy.
ScraperAPI provides a proxy service you can combine with Puppeteer. Puppeteer will forward the request to a proxy, which will send the request to the website and bypass all the bot detection methods. The proxy will then render the web page, so Puppeteer can perform actions on the website.
To use the ScraperAPI, create an account and get 5,000 free API credits to get started. In the ScraperAPI Dashboard account, retrieve the following proxy information:
- The host
- The port
- The username
- The password.
Create a new file, scraper-proxy.js, and add the code below:
const puppeteer = require('puppeteer');
const PROXY_USERNAME = 'scraperapi';
const PROXY_PASSWORD = 'API_KEY'; // <-- enter your API_Key here
const PROXY_SERVER = 'proxy-server.scraperapi.com';
const PROXY_SERVER_PORT = '8001';
const PAGE_URL = 'https://www.monster.com/jobs/l-houston-tx';
const main = async () => {
const browser = await puppeteer.launch({
ignoreHTTPSErrors: true,
args: [
`--proxy-server=http://${PROXY_SERVER}:${PROXY_SERVER_PORT}`
]
});
const page = await browser.newPage();
await page.authenticate({
username: PROXY_USERNAME,
password: PROXY_PASSWORD,
});
console.log('Downloading page...');
await page.goto(PAGE_URL, { waitUntil: 'networkidle0' });
const html = await page.content();
await browser.close();
console.log(html);
}
void main();
Run the command node scraper-proxy.js
to execute the file and see the output:
With the help of the proxy, we can get the web page content.
Using Puppeteer Stealth
Puppeteer Stealth is a set of techniques applied to minimize the chance of websites detecting and blocking Puppeteer headless browsers. These techniques mask common characteristics of headless browsers.
Puppeteer Stealth will make the browser behave more like a human browser by hiding headless browser characteristics, customizing the User-Agent, mocking browser permissions, adding more mouse and keyboard simulation, etc.
To use Puppeteer Stealth, you must install two Node.js packages:
npm install puppeteer-extra puppeteer-extra-plugin-stealth
Here is the code to scrape a website with Puppeteer Stealth:
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
const PAGE_URL = 'https://eventbrite.com/d/online/events/';
puppeteer.use(StealthPlugin());
const main = async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(PAGE_URL);
const html = await page.content();
await browser.close();
console.log(html);
};
void main();
Simplify Your Puppeteer Web Scraping with ScraperAPI
So far, we used Puppeteer to load web pages, perform interactions on the page, and download the content. ScraperAPI offers an API that simplifies web scraping on dynamic websites. It is called the Rendering Instruction set.
Using the JSON format, you can describe actions to perform on a web page during rendering. These actions include:
- Scrolling
- Clicking
- Typing in an input
- Waiting for a selector
And more.
Once the rendering instruction object is ready, you can request ScraperAPI to execute the task and return the result.
Take the example where we wanted to search for Content Writing jobs and retrieve them. Here are the interactions we did:
- Type the text “Content writing” into the search input.
- Wait for the search results list to appear.
- Click on the first item of the results list.
Here is the rendering instruction set object for this interaction:
[
{ type: 'input', selector: { type: 'css', value: 'input.search' }, value: 'content writing' },
{ type: 'wait_for_selector', selector: { type: 'css', value: '.search-filter-results' } },
{ type: 'click', selector: { type: 'css', value: '.search-filter-results div:first-child' } }
]
Let’s use it in an API request to ScraperAPI to execute and retrieve the content. Create a file rendering-set-intruction.js and add the code below:
import fetch from 'node-fetch';
const API_KEY = 'API_KEY'; // <-- enter your API_Key here
const PAGE_URL = 'https://remoteok.com';
const INSTRUCTIONS = [
{ type: 'input', selector: { type: 'css', value: 'input.search' }, value: 'content writing' },
{ type: 'wait_for_selector', selector: { type: 'css', value: '.search-filter-results' } },
{ type: 'click', selector: { type: 'css', value: '.search-filter-results div:first-child' } }
];
const headers = {
'x-sapi-api_key': API_KEY,
'x-sapi-render': 'true',
'x-sapi-instruction_set': JSON.stringify(INSTRUCTIONS)
};
const main = async () => {
try {
const response = await fetch(`https://api.scraperapi.com?url=${PAGE_URL}`, {
method: 'GET',
headers: headers
});
const html = await response.text();
console.log(html);
} catch (error) {
console.error('Error fetching data:', error);
}
};
void main();
Run the command node rendering-instruction-set.js
to execute the file and see the output. Now, you can use Cheerio to load the HTML content and extract the data.
Check out the Rendering instruction set documentation to view other available instructions and how to use it in proxy mode.
By using ScraperAPI, you offload the complexity of scraping data to its API so you can focus on extracting and using website data. ScraperAPI will manage the browser instance to render the web page, handle the CAPTCHA challenge, rotate the IP address to avoid bot detection, and more.
You will not have to manage Puppeteer dependencies and instances in your web scraper; you only need to create an account, and you’re ready to go.
FAQs about Puppeteer Web Scraping
In 2025, Puppeteer is still widely used for web automation, scraping, and testing purposes. It remains popular because it is robust and reliable.
Puppeteer is good for web scraping because it offers many features, such as a headless browser to programmatically launch a browser instance, render dynamic web pages, perform user interactions, and integrate with proxies like ScraperAPI.
Puppeteer is free for personal and commercial projects; it is open source under a license, allowing you to customize and distribute it.
Selenium supports many programming languages, while Puppeteer only works in JavaScript.
Selenium Supports all major browsers, while Puppeteer works with Chromium-based browsers and limited support for Firefox.
Selenium has a higher learning curve than Puppeteer due to the support of many browsers and programming languages.
Selenium is less performant than Puppeteer in most cases because it uses the WebDriver protocol, which introduces some latency, while Puppeteer uses the Chrome DevTools protocol.
Puppeteer is more performant for web scraping than Selenium, thanks to its DevTools Protocol and the rich features it provides for browser rendering and interaction. However, this only applies if you are using JavaScript for web scraping.
• Playwright supports many programming languages, while Puppeteer only works in JavaScript.
• Playwright supports all major browsers, while Puppeteer works with Chromium-based browsers and has limited support for Firefox.
• Playwright has built-in stealth features, while Puppeteer requires extra dependencies to provide the stealth capability.
• Playwright scales better than Puppeteer because its built-in browser context allows many contexts to run in a single browser instance.
• Playwright has more robust network interception features compared to Puppeteer, such as the ability to simulate network failures and respond to mock requests.
Puppeteer is more performant for web scraping than Selenium, thanks to its DevTools Protocol and the rich features it provides for browser rendering and interaction. However, this only applies if you are using JavaScript for web scraping.
• Playwright supports many programming languages, while Puppeteer only works in JavaScript.
• Playwright supports all major browsers, while Puppeteer works with Chromium-based browsers and has limited support for Firefox.
• Playwright has built-in stealth features, while Puppeteer requires extra dependencies to provide the stealth capability.
• Playwright scales better than Puppeteer because its built-in browser context allows many contexts to run in a single browser instance.
• Playwright has more robust network interception features compared to Puppeteer, such as the ability to simulate network failures and respond to mock requests.
There are great web scraping tools, and deciding which is better always depends on the use case. There are use cases where Puppeteer is better than Playwright or Selenium and vice versa. Puppeteer is the favorite browser automation tool for many developers and definitely should be among your first choices for web scraping.