In this guide, I will show you how you can easily use ScraperAPI with Python’s Pyppeteer library for headless browser automation. I will walk you through exactly how to integrate ScraperAPI and avoid common issues so you can get scraping as quickly as possible.
Recommended Method: Use Proxy Configuration
To correctly use Pyppeteer with ScraperAPI, you should configure the browser to use our proxy servers directly, like you would any other proxy.
First, install Pyppeteer:
pip install pyppeteer
Here’s how to configure your Pyppeteer browser to route all requests through ScraperAPI:
import asyncio
from pyppeteer import launch
API_KEY = 'YOUR_API_KEY'
async def main():
browser = await launch({
'args': [
f'--proxy-server=http://proxy-server.scraperapi.com:8001'
]
})
page = await browser.newPage()
await page.authenticate({
'username': 'scraperapi',
'password': API_KEY
})
await page.goto('http://quotes.toscrape.com/')
quotes = await page.evaluate('''() => {
return Array.from(document.querySelectorAll('.quote')).map(quote => ({
text: quote.querySelector('.text').innerText,
author: quote.querySelector('.author').innerText
}));
}''')
print(quotes)
await browser.close()
asyncio.run(main())
And that’s it. Whenever Pyppeteer makes a request, it will send the request through our proxy servers, and you can use Pyppeteer as you would normally do.
Here is an example result:
[
{
"text": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”",
"author": "Albert Einstein"
},
{
"text": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”",
"author": "J.K. Rowling"
},
{
"text": "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”",
"author": "Albert Einstein"
},
{
"text": "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”",
"author": "Jane Austen"
},
{
"text": "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
"author": "Marilyn Monroe"
},
{
"text": "“Try not to become a man of success. Rather become a man of value.”",
"author": "Albert Einstein"
},
{
"text": "“It is better to be hated for what you are than to be loved for what you are not.”",
"author": "André Gide"
},
{
"text": "“I have not failed. I've just found 10,000 ways that won't work.”",
"author": "Thomas A. Edison"
},
{
"text": "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
"author": "Eleanor Roosevelt"
},
{
"text": "“A day without sunshine is like, you know, night.”",
"author": "Steve Martin"
}
]
Alternative Method: Using The SDK
You can also use our Python SDK for simpler integration:
First, install ScraperAPI SDK
pip install scraperapi-sdk
Example
import asyncio
from pyppeteer import launch
from scraperapi_sdk import ScraperAPIClient
API_KEY = 'YOUR_API_KEY'
client = ScraperAPIClient(API_KEY)
async def main():
browser = await launch(headless=True)
page = await browser.newPage()
html = client.get('http://quotes.toscrape.com/')
await page.setContent(html)
quotes = await page.evaluate('''() => {
return Array.from(document.querySelectorAll('.quote')).map(quote => ({
text: quote.querySelector('.text').innerText,
author: quote.querySelector('.author').innerText
}));
}''')
print(quotes)
await browser.close()
asyncio.run(main())
Using Additional ScraperAPI Functionality
ScraperAPI lets you customize the API’s functionality by adding additional parameters to your requests. For example, you can use premium proxies by setting custom headers:
await page.setExtraHTTPHeaders({
'X-ScraperAPI-Premium': 'true',
'X-ScraperAPI-Country': 'us',
'X-ScraperAPI-Session': '123'
})
The API will accept the following parameters via headers:
Parameter | Header | Description |
premium | X-ScraperAPI-Premium | Activate premium residential and mobile IPs by setting it to `true` |
country_code | X-ScraperAPI-Country | Activate country geotargeting by setting it to `us` |
session_number | X-ScraperAPI-Session | Reuse the same proxy by setting it to `123` |
Configuring Concurrency and Retries
To get the most out of your ScraperAPI plan, configure your concurrent browser limit based on your plan’s thread allowance:
import asyncio
from pyppeteer import launch
API_KEY = 'YOUR_API_KEY'
CONCURRENT_BROWSERS = 5 # Free Plan has 5 concurrent threads
async def scrape_page(url):
browser = await launch({
'args': [f'--proxy-server=http://proxy-server.scraperapi.com:8001']
})
page = await browser.newPage()
await page.authenticate({
'username': 'scraperapi',
'password': API_KEY
})
await page.goto(url)
quotes = await page.evaluate('''() => {
return Array.from(document.querySelectorAll('.quote')).map(quote => ({
text: quote.querySelector('.text').innerText,
author: quote.querySelector('.author').innerText
}));
}''')
await browser.close()
return {'url': url, 'quotes': quotes}
async def scrape_multiple_pages(urls):
semaphore = asyncio.Semaphore(CONCURRENT_BROWSERS)
async def scrape_with_semaphore(url):
async with semaphore:
return await scrape_page(url)
tasks = [scrape_with_semaphore(url) for url in urls]
results = await asyncio.gather(*tasks)
print(results)
# Usage
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
]
asyncio.get_event_loop().run_until_complete(scrape_multiple_pages(urls))
Retry Failed Requests
For most sites, over 97% of your requests will be successful on the first try. However, some requests may fail. Implement retry logic to handle these cases:
import asyncio
from pyppeteer import launch
API_KEY = 'YOUR_API_KEY'
MAX_RETRIES = 3
async def scrape_with_retry(url, retries=MAX_RETRIES):
for i in range(retries):
try:
browser = await launch({
'args': [f'--proxy-server=http://proxy-server.scraperapi.com:8001']
})
page = await browser.newPage()
await page.authenticate({
'username': 'scraperapi',
'password': API_KEY
})
await page.goto(url, {'timeout': 60000}) # 60 second timeout
quotes = await page.evaluate('''() => {
return Array.from(document.querySelectorAll('.quote')).map(quote => ({
text: quote.querySelector('.text').innerText,
author: quote.querySelector('.author').innerText
}));
}''')
await browser.close()
return quotes
except Exception as e:
print(f"Attempt {i + 1} failed: {e}")
if i == retries - 1:
raise e
# Usage
async def main():
quotes = await scrape_with_retry('http://quotes.toscrape.com/')
print(quotes)
asyncio.run(main())
Final Notes
- Pyppeteer + ScraperAPI = scalable, JS-capable scraping with proxy rotation
- Method 1 (proxy mode) is best for interactive/page-script scraping
- Method 2 (SDK) is great for HTML snapshots + fast DOM parsing
- Configure concurrency and retry based on your plan
- Works with JavaScript-heavy sites thanks to Pyppeteer
More Resources: