If you’ve ever built a web scraper, you know the pain. You build a scraper, and it works great on 1,000 pages, but the moment you scale up to 10,000 or more, things become slow. Here’s the good news: there’s a fix!
In this article, you’ll learn everything about:
- What concurrent threads are
- How to set up ScraperAPI’s concurrent threads
- How to use them to scrape web pages faster and more efficiently
So, what are concurrent threads?
If you’ve used ScraperAPI before, you already know the basics—you hit the API to fetch the pages you need. With concurrent threads, you can send multiple requests at the same time. Instead of scraping one page, waiting, and then scraping the next, you can run several requests in parallel and get results way faster.
Let’s say you’re using 5 concurrent threads. That means you’re making 5 requests to ScraperAPI at once, all running in parallel. So, the more threads you use, the more requests you can send at once, and the faster your scraper runs.
Each ScraperAPI plan comes with its own thread limit. For example:
- The Business plan gives you up to 100 concurrent threads
- The Scaling plan bumps that up to 200 threads
However, if your scraping needs go beyond that, we’ve got you covered with our Enterprise plan. With Enterprise, there’s no fixed cap. We work with you to tailor a custom thread limit based on your exact use case so you get the best speed and performance.
How to increase your scraping speed?
Now that we know what concurrent threads are, it’s time to see them in action.
We’ll run a simple experiment to test how performance scales with different thread limits and show just how much speed you can unlock.
First, we’ll create a list of 1000+ URL samples. To do that, we’ll crawl https://edition.cnn.com/business/tech and extract URLs using open-source tools like Scrapy. This step is just to get the sample URLs we want to scrape. In your case, these URLs would be the actual pages that you need to scrape.
Once we have the list of URLs, we’ll hit the ScraperAPI endpoint twice:
- First, using 100 concurrent threads.
- Then again, with 500 concurrent threads.
Finally, we’ll measure how long each run takes.
Stage 1: Create a list of sample URLs to scrape
Follow these steps to create a list of URLs from https://edition.cnn.com/business/tech:
Step 1: Open the command prompt or terminal, go to your project folder, and install Scrapy and BeautifulSoup (which we will need later).
pip install scrapy bs4
Step 2: Start a new Scrapy project.
scrapy startproject cnn_scraper
cd cnn_scraper
Step 3: Go inside the /spiders folder and create a Python file.
cd spiders
touch cnn_spider.py
Step 4: In your IDE, go to cnn_scraper/spiders/cnn_spider.py
and paste the following code:
import scrapy
from urllib.parse import urljoin, urlparse
class CnnSpider(scrapy.Spider):
name = "cnn"
allowed_domains = ["edition.cnn.com"]
start_urls = ["https://edition.cnn.com/business/tech"]
seen_urls = set()
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}
def parse(self, response):
links = response.css("a::attr(href)").getall()
for link in links:
if link.startswith("/"):
full_url = urljoin("https://edition.cnn.com", link)
elif link.startswith("http") and "edition.cnn.com" in link:
full_url = link
else:
continue
if full_url not in self.seen_urls:
self.seen_urls.add(full_url)
yield {"url": full_url}
yield response.follow(full_url, callback=self.parse)
if len(self.seen_urls) >= 1000:
self.crawler.engine.close_spider(self, "URL limit reached")
In the above code, custom_settings
sets the User-Agent header Scrapy sends with each request, making the spider look like a real browser. The parse() function uses the getall() built-in function to collect and process all the links on the current page, and turn them into full links. The if condition (if full_url not in self.seen_urls) is only to process links you haven’t seen before.
Step 5: To run the above code and save the URLs into a JSON file, execute the following command from the cnn_scraper/spiders folder
:
scrapy crawl cnn -o urls.json
Stage 2: Let’s scrape the saved URLs using ScraperAPI
Step 1: Create a Python file–I named mine scraper_api.py
, but you can pick whatever name works for you–and paste the following code in it:
import requests
import json
import csv
import time
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
API_KEY = 'ScraperAPI API_key'
NUM_RETRIES = 3
NUM_THREADS = 100
with open("path/to/URLs_json_file", "r") as file:
raw_data = json.load(file)
list_of_urls = [item["url"] for item in raw_data if "url" in item]
def scrape_url(url):
params = {
'api_key': API_KEY,
'url': url
}
for _ in range(NUM_RETRIES):
try:
response = requests.get('http://api.scraperapi.com/', params=params)
if response.status_code in [200, 404]:
break
except requests.exceptions.ConnectionError:
continue
else:
return {
'url': url,
'h1': 'Failed after retries',
'title': '',
'meta_description': '',
'status_code': 'Error'
}
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
h1 = soup.find("h1")
title = soup.title.string.strip() if soup.title else "No Title Found"
meta_tag = soup.find("meta", attrs={"name": "description"})
meta_description = meta_tag["content"].strip() if meta_tag and meta_tag.has_attr("content") else "No Meta Description"
return {
'url': url,
'h1': h1.get_text(strip=True) if h1 else 'No H1 found',
'title': title,
'meta_description': meta_description,
'status_code': response.status_code
}
else:
return {
'url': url,
'h1': 'No H1 - Status {}'.format(response.status_code),
'title': '',
'meta_description': '',
'status_code': response.status_code
}
start_time = time.time()
#concurrent threads
with ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
scraped_data = list(executor.map(scrape_url, list_of_urls))
elapsed_time = time.time() - start_time
print(f"Using 100 concurrent threads, scraping completed in {elapsed_time:.2f} seconds.")
# Save to CSV
with open("cnn_h1_1000_1_results.csv", "w", newline='', encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["url", "h1", "title", "meta_description", "status_code"])
writer.writeheader()
writer.writerows(scraped_data)
The function scrape_url(url)
sends a request to ScraperAPI using the given URL. If the response status code is not 200 or 404, it retries up to NUM_RETRIES
times. If it gets a 200 OK, it uses BeautifulSoup to parse the H1, title, and meta description.
The part ThreadPoolExecutor(max_workers=NUM_THREADS)
sends the concurrent requests to ScraperAPI. In the end, the code saves the scraped data to a CSV file.
When NUM_THREADS == 100
, it took 100.68 seconds to scrape the titles.
In the same code, we only changed the number of concurrent threads to 500; now it took just 23.56 seconds.
Just like that, I slashed the scraping time from around 100 seconds down to just 23 seconds. That’s nearly 4 times faster with 500 threads compared to 100!
To optimize your performance with custom concurrent threads, upgrade to our custom enterprise plan today.