Turn webpages into LLM-ready data at scale with a simple API call

Playwright Web Scraping: Complete Guide for 2025

Excerpt content

Struggling with web scraping complex websites? Playwright web scraping might be your solution. This powerful tool developed by Microsoft is designed to easily tackle modern challenges, making it a favorite among developers.

This guide will walk you through everything you need to know about using Playwright for web scraping. You’ll learn to handle even the most challenging scraping tasks, from setting it up to using advanced features like proxies and HTTP request interception. 

Let’s dive in and see how Playwright can simplify web scraping for you!

Simplify web scraping with ScraperAPI
Just a simple API will let you bypass even the most advanced bot-blockers, including DataDome and PerimerX, and scale your projects effortlessly.

What is Playwright?

Playwright is an open-source tool that lets you control web browsers through code. It’s excellent for web scraping, testing, and automating tasks. The best part? It works with multiple browsers like Chromium, Firefox, and WebKit so that you can use the same code across different platforms.

Playwright makes scraping easy. You can interact with dynamic websites, run JavaScript, and bypass many anti-bot protections. It’s a tool that simplifies the toughest challenges, giving you the flexibility and power needed to scrape effectively.

Is Playwright Good for Web Scraping?

Playwright has quickly become a favorite for web scraping thanks to its easy handling of modern web technologies. Traditional tools often struggle with JavaScript-heavy pages or websites that use advanced anti-bot systems, but Playwright excels in these areas.

Its multi-browser support allows developers to scrape websites across Chromium, Firefox, and WebKit without modifying their code. This flexibility ensures that scrapers work consistently, regardless of the browser used. Additionally, Playwright integrates seamlessly with Python and other programming languages, making it accessible even to those new to automation.

What sets Playwright apart is its ability to interact with dynamic content, handle complex workflows, and bypass common scraping roadblocks. These features make it an excellent choice for developers looking to extract data from modern websites efficiently.

Project Requirements

Here are the tools and setup required to start web scraping with Playwright and Python:

  1. Install Python: Ensure you have installed Python 3.7 or a newer version. Download it from the official Python website, and verify the installation by running python --version in your terminal.
  2. Install Playwright: Use the command pip install playwright to add Playwright to your project. After installation, run playwright install to download the necessary browser binaries.
  3. Set Up Your Project Directory: Organize your files by creating a dedicated directory for your scraping scripts. Use mkdir playwright_scraper and navigate to it using cd playwright_scraper.
  4. Write a Test Script: Create a simple Python script to verify your setup. Here’s an example that uses Google as the test site:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://google.com")
    print(page.title())
    browser.close()

Once these steps are complete, you’re ready to start building and running your scraping projects efficiently!

TL;DR: Playwright Scraping Basics

Below is the full code for scraping a website using Playwright. This will help you understand the core features and functionalities of Playwright and how to apply them in web scraping and automation tasks effectively:

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def main():
    with sync_playwright() as p:
        # Launch a browser
        browser = p.chromium.launch(headless=True)
        print("Browser launched!")

        # Open Hacker News homepage
        page = browser.new_page()
        page.goto("https://news.ycombinator.com/")
        page.wait_for_load_state("networkidle")
        print("Page loaded successfully:", page.title())

        # Scrape top headlines
        page.wait_for_selector(".storylink")
        headlines = page.locator(".storylink").all_inner_texts()
        print("Top Headlines:")
        for headline in headlines:
            print("-", headline)

        # Search for "AI"
        search_box = page.locator("input[type='text']")
        search_box.fill("AI")
        search_box.press("Enter")
        page.wait_for_load_state("networkidle")

        # Parse search results with BeautifulSoup
        page_content = page.content()
        soup = BeautifulSoup(page_content, "html.parser")
        first_headline = soup.select_one(".Story_title")
        if first_headline:
            print("First search result headline:", first_headline.text)
        else:
            print("No search results found.")

        # Close the browser
        browser.close()
        print("Browser closed.")

if __name__ == "__main__":
    main()

How It Works

Here’s a breakdown of Playwright basics based on the code:

Navigation (i.e., go to URL)

Playwright’s goto() method directs the browser to a specified URL. This is essential for opening web pages:

page.goto("https://news.ycombinator.com/")

Button Clicking

Simulate button clicks using press() or click() methods. For example, pressing Enter after typing in a search query:

search_box.press("Enter")

Text Input

Use the fill() or type() methods to type text into form fields. For example, to enter “AI” in the search bar:

search_box.fill("AI")

Or, if you want to simulate typing with a delay between keystrokes:

search_box.type("AI", delay=100)

JavaScript Execution

Playwright automatically handles JavaScript execution on pages. You can interact with dynamically loaded elements using:

page.wait_for_selector(".storylink")

This ensures elements are ready before interaction.

Waiting for Content to Load

Ensure the page has fully loaded before proceeding with actions. Use wait_for_load_state() for network idle states:

page.wait_for_load_state("networkidle")

Understanding these basics gives you a solid foundation to scrape websites effectively using Playwright. 

Keep reading to dive deeper into advanced techniques and features that will elevate your scraping skills!

How to Use Playwright for Web Scraping

Step 1: Launch a Browser

The first step in any Playwright project is to launch a browser. Using headless mode (without a visible browser window) is faster and more efficient for scraping tasks. Here’s how to get started:

from playwright.sync_api import sync_playwright
with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)

If you prefer to see what the browser does during development, set headless=False to enable a GUI.

Step 2: Open a Page and Handle Navigation

Once the browser is launched, open the Hacker News homepage using the goto() method. Playwright’s wait_for_load_state() ensures that the page has fully loaded before moving on:

    page = browser.new_page()  # Open a new browser tab
    page.goto("https://news.ycombinator.com/")  # Navigate to Hacker News
    page.wait_for_load_state("networkidle")  # Wait until network activity settles
    print("Page loaded successfully:", page.title())

Step 3: Work with Dynamic Elements and Extracting Data

Next, we’ll use wait_for_selector() to ensure specific elements are ready before interacting with them and Playwright’s locator() method to locate and extract the top headlines:

    page.wait_for_selector(".storylink")  # Wait for headlines to appear
    headlines = page.locator(".storylink").all_inner_texts()
    print("Top Headlines:")
    for headline in headlines:
        print("-", headline)

This ensures that your script won’t fail due to missing elements still loading.

Step 4: Simulate Form Input and Searching

Hacker News has a search bar that allows us to query for content directly. Let’s simulate searching for the term “AI” using Playwright. After performing the search, we’ll parse the results using BeautifulSoup:

from bs4 import BeautifulSoup

    # Interact with the Hacker News search bar
    search_box = page.locator("input[type='text']")  # Locate the search input box
    search_box.fill("AI")  # Type the search query
    search_box.press("Enter")  # Simulate pressing Enter
    page.wait_for_load_state("networkidle")  # Wait for search results to load

    # Extract the page content and parse with BeautifulSoup
    page_content = page.content()  # Get the full HTML content
    soup = BeautifulSoup(page_content, "html.parser")

    # Extract the first headline from search results
    first_headline = soup.select_one(".Story_title")
    if first_headline:
        print("First search result headline:", first_headline.text)
    else:
        print("No search results found.")

Step 5: Close the Browser

Once scraping is complete, close the browser to free up resources:

   browser.close()
    print("Browser closed.")

By following these steps, you can scrape both dynamic content and simulate search functionality effectively using Playwright and BeautifulSoup!

Using Proxies in Python Playwright

Playwright supports configuring proxies directly in the launch() method. Here are two straightforward ways to use proxies:

1. Simple HTTP Proxy

This is the easiest setup, where you specify a proxy server without any authentication:

from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.chromium.launch(
        proxy={"server": "http://11.11.11.1:9000"}
    )
    page = browser.new_page()
    page.goto("https://news.ycombinator.com/")
    print("Page loaded through proxy!")
    browser.close()

2. Authenticated Proxy

For proxies requiring credentials, you can add a username and password to the proxy configuration:

from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.chromium.launch(
        proxy={
            "server": "http://11.11.11.1:9000",
            "username": "your-username",
            "password": "your-password"
        }
    )
    page = browser.new_page()
    page.goto("https://news.ycombinator.com/")
    print("Page loaded through authenticated proxy!")
    browser.close()

Advanced Playwright Functions and Techniques

Playwright offers several advanced functions to enhance your web scraping capabilities. These include intercepting HTTP requests, evaluating JavaScript, and blocking unnecessary resources from loading. 

Let’s explore how to use these techniques:

Intercept HTTP Requests

Intercepting HTTP requests gives you fine-grained control over network activity. Whether you’re debugging, mocking API responses, or filtering out unwanted requests, here’s how you can use it effectively:

Logging and Filtering Requests

Capture and inspect specific request types or resources:

page.on("request", lambda request: print(f"Request: {request.url}, Type: {request.resource_type}"))
page.route("**/*", lambda route, request: route.continue_() if request.resource_type == "script" else route.abort())

Mocking API Responses

Mocking responses can save time and ensure consistency during testing or rate-limited scraping:

page.route("https://example.com/api/data", lambda route: route.fulfill(
    status=200,
    content_type="application/json",
    body='{"key": "mocked value"}'
))

Modifying Request Headers

Modify or add headers to requests, such as custom tokens for authentication:

page.route("**/*", lambda route, request: route.continue_(
    headers={**request.headers, "Authorization": "Bearer mytoken"}
))

This flexibility allows for tailored HTTP interactions to meet your scraping needs.

Evaluate JavaScript

Evaluating JavaScript is one of Playwright’s most powerful features, allowing you to interact with the browser’s DOM or execute custom scripts to extract data and manipulate the page. Here are a few advanced use cases:

Extracting Multiple Elements Dynamically

You can execute JavaScript to extract data dynamically, such as gathering all links on the page:

from playwright.sync_api import sync_playwright

def main():
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto("https://news.ycombinator.com/")

        # Extract all hyperlinks dynamically
        links = page.evaluate("Array.from(document.querySelectorAll('a')).map(a => a.href)")
        print("Extracted Links:")
        for link in links:
            print(link)

        browser.close()

if __name__ == "__main__":
    main()

Interacting with Complex DOM Elements

Sometimes you need to interact with deeply nested or dynamically generated elements. Use JavaScript to locate and manipulate these elements:

# Click on the first story link dynamically
page.evaluate("document.querySelector('.titleline').click()")

Modifying Page Content

You can even manipulate the DOM, such as changing text or injecting new elements:

# Change the headline of the first story
page.evaluate("document.querySelector('.titlelink').textContent = 'Modified Headline'")

Why Use Evaluate?

  • Flexibility: Execute custom scripts tailored to the page’s structure.
  • Dynamic Data: Extract or interact with elements that are loaded via JavaScript.
  • Manipulation: Modify the DOM to suit your scraping or testing needs.

Block Resources from Loading

When scraping web pages, many resources like images, stylesheets, and scripts are unnecessary and can slow down the process. Playwright allows you to control what gets loaded by routing requests. Let’s break it down:

Block Specific Resource Types

You can block certain types of resources based on their role in the webpage. For example, to block images, videos, and stylesheets:

page.route("**/*", lambda route, request: route.abort() if request.resource_type in ["image", "stylesheet", "media"] else route.continue_())

Here, request.resource_type checks the type of resource being requested. If it matches one of the blocked types, the request is aborted; otherwise, it continues.

Block Requests to Specific Domains

Sometimes, you may want to block all requests to a particular domain, such as ad servers or analytics platforms:

page.route("**/*", lambda route, request: route.abort() if "ads.com" in request.url else route.continue_())

This ensures that unnecessary requests from specific domains do not slow your scraping or inflate your bandwidth usage.

Allow Only Targeted Resources

In some cases, it’s more efficient to allow only specific resources by whitelisting URLs or domains:

page.route("**/*", lambda route, request: route.continue_() if "example.com" in request.url else route.abort())

Only requests to example.com will proceed here, while all others are blocked. This is particularly useful when scraping APIs embedded in a webpage.

Why Use Resource Blocking?

  • Speed: Blocking unnecessary resources speeds up page loads.
  • Efficiency: Reduces bandwidth usage, especially on media-heavy websites.
  • Focus: Ensures your scraper processes only the relevant content.

By mastering these advanced Playwright features, you’ll be equipped to handle even the most complex scraping challenges efficiently.

Common Challenges When Scraping with Playwright & Their Solutions

Scraping with Playwright can be highly effective, but it comes with its own set of challenges. Let’s dive into common challenges and explore practical strategies for overcoming them.

Scaling Browser Operations

Running multiple browser instances simultaneously can quickly become overwhelming. It’s not just about keeping everything running; you must ensure stability and avoid crashes while juggling configurations for different tasks.

  • How to solve it: Tools like Docker can make things much easier by isolating browser instances. Playwright’s built-in parallelization features are great for handling multiple sessions without hassle. 

Overcoming IP Restrictions

Frequent or repeated requests from the same IP address are a red flag for most websites. This can lead to rate limits or outright bans, derailing your scraping efforts.

  • How to solve it: A good proxy pool is essential here. Rotate IPs regularly and add delays between requests to mimic actual user behavior. Session management can also help maintain cookies and avoid re-authentication.

Adapting to Website Specifics

No two websites are the same. From layout changes to security updates, staying on top of unique configurations for each site can feel like a full-time job.

  • How to solve it: Modularize your scraper configurations. This way, you can quickly adapt to changes without starting from scratch. If a site changes its layout, having scripts that can auto-detect and adjust to these changes can save you a ton of time.

Managing Resources and Bandwidth

Resource-heavy pages with many images, videos, or dynamic elements can drain bandwidth and slow your scrapers.

  • How to solve it: Block unnecessary assets like images and stylesheets to keep things running smoothly. Load balancing across multiple servers can also help manage the workload effectively.

Navigating Anti-Scraping Defenses

Websites are getting smarter at identifying bots. Tools that monitor user behavior—like mouse movements and navigation speed—can easily flag and block automated scrapers.

  • How to solve it: Simulate human-like behavior. This means adding random delays, mimicking natural interactions like mouse movements, and introducing variability in click patterns. Regularly update your scripts to keep up with evolving anti-bot measures.

By understanding and tackling these challenges head-on, you can create scraping workflows that are efficient and robust against the most common roadblocks.

Which Is Better? Playwright vs Puppeteer vs Selenium

Regarding browser automation and web scraping, three major players dominate the field: Playwright, Puppeteer, and Selenium. Each tool has its strengths and weaknesses, making it essential to evaluate them based on your project needs.

Playwright

Playwright is a newer tool developed by Microsoft that supports multiple browsers, including Chromium, Firefox, and WebKit. It is mainly known for its modern features, including auto-waiting for elements, built-in parallelism, and native support for headless and headed browsers.

Puppeteer

Google’s Puppeteer is streamlined for Chrome and Chromium, making it lightweight and easy for projects requiring a tightly integrated Chrome experience. It excels in performance for Chrome-specific tasks and is ideal for developers who prefer JavaScript-based solutions.

Selenium

Selenium has been the go-to tool for browser automation for over a decade. With support for multiple browsers and programming languages, it’s an excellent choice for legacy applications, cross-browser testing, and environments requiring broad compatibility.

Here’s a quick comparison of the three tools:

FeaturePlaywrightPuppeteerSelenium
Browser SupportChromium, Firefox WebkitChrome, ChromiumChrome, Firefox, Edge, Safari
Language SupportPython, Javascript, Java, C#JavascriptPython, Ruby, Javascript, Java, C#
Ease Of UseExcellentGoodFair
Community and DocumentationSmall active community with detailed documentationLarge active community with good documentationLarge active community with older documentation
PerformanceFastFastSlower
Developer ExperienceIntuitive and easy to useLightweight and easy setupCan feel clunky and outdated
Headless ModeBuilt-in and well-optimized Built-in and fastSupported but less efficient

Choosing the Right Tool

  • Go with Playwright if:
    • You need multi-browser support.
    • Your project involves dynamic content that requires auto-waiting for elements.
    • You want built-in features for parallel execution.
  • Go with Puppeteer if:
    • You work exclusively with Chrome/Chromium.
    • You want a lightweight, JavaScript-native library.
    • Your use case focuses on Chrome-specific tasks.
  • Go with Selenium if:
    • You need extensive browser and language support.
    • Your project requires compatibility with older browsers or legacy systems.
    • You’re running cross-platform and cross-browser tests.

Each tool is a strong contender, depending on your requirements. Playwright often edges out for high-performance scraping of modern web applications thanks to its advanced features and ease of use. However, Puppeteer remains a top choice for Chrome-focused tasks, and Selenium is unparalleled for extensive compatibility testing.

Rendering JavaScript Sites Without Playwright

While Playwright offers extensive control over browser-based scraping, managing all the technical complexities like proxies, browser instances, and JavaScript rendering can quickly become overwhelming. This is where ScraperAPI steps in to simplify the process. With its render instruction set, ScraperAPI handles dynamic content rendering, JavaScript execution, and anti-scraping measures seamlessly, so you don’t have to.

Here’s how you can scrape Hacker News headlines and perform a search using ScraperAPI’s render instruction set to manage JavaScript rendering.

import requests
from bs4 import BeautifulSoup

# ScraperAPI credentials and base URL
API_KEY = "YOUR_API_KEY"
BASE_URL = "http://api.scraperapi.com"

url = 'https://api.scraperapi.com/'
# Define headers with your API key and rendering settings
headers = {
    'x-sapi-api_key': API_KEY,
    'x-sapi-render': 'true',
    'x-sapi-instruction_set': '[ {"type": "input", "selector": {"type": "css", "value": ".SearchInput"}, "value": "AI "} , {"type": "wait_for_selector","selector": { "type": "css", "value": ".Story_title"}}, {"type": "wait","value": 10}]'
}
payload = {
    'url': "https://hn.algolia.com/"
}
response = requests.get(url, params=payload, headers=headers)

soup = BeautifulSoup(response.content, "html.parser")
first_result = soup.select_one(".Story_title")
print("First Search Result:", first_result.text if first_result else "No results found.")

Why Use ScraperAPI?

  • Dynamic Rendering: ScraperAPI automatically handles JavaScript-heavy websites without requiring a browser instance.
  • Simplified Proxy Management: It rotates proxies for you, ensuring requests don’t get blocked.
  • Customizable Instructions: The render instruction set lets you mimic user interactions like clicks or form submissions with simple configurations.
  • Scalability: ScraperAPI scales seamlessly with your workload, letting you focus on data extraction instead of infrastructure.

FAQs about Playwright web scraping

Can Playwright be detected?

Yes, Playwright can be detected, especially by websites with advanced anti-bot systems. Detection happens through browser fingerprints, unorthodox interaction patterns, or automated browsing behavior. To avoid detection, use stealth plugins, rotate user agents, and implement human-like interaction simulations.

Does Playwright have a UI?

Yes, Playwright includes a UI tool called Playwright Inspector. This tool provides features like time travel debugging, a locator picker, and watch mode, making it easier to visually debug and test your scripts than using only the command line.

How to speed up a Playwright scraper?

To improve the speed of a Playwright scraper, use headless mode to reduce resource consumption and block unnecessary assets such as images, videos, and stylesheets. Running multiple browser instances in parallel can help handle large-scale scraping while minimizing delays and optimizing waiting conditions to ensure faster page interactions.

Which headless browser is best to use for Playwright scraping?

Chromium is the best choice for Playwright scraping due to its speed, reliability, and compatibility with modern websites. While Playwright also supports WebKit and Firefox, Chromium delivers superior performance and ecosystem support for most use cases.

About the author

Picture of Ize Majebi

Ize Majebi

Ize Majebi is a Python developer and data enthusiast who delights in unraveling code intricacies and exploring the depths of the data world. She transforms technical challenges into creative solutions, possessing a passion for problem-solving and a talent for making the complex feel like a friendly chat. Her ability brings a touch of simplicity to the realms of Python and data.