Turn webpages into LLM-ready data at scale with a simple API call

Integrating ScraperAPI with Data Cleaning Pipelines

Integrating-ScraperAPI-with-Data-Cleaning-pipelines

Collecting clean, usable data is the foundation of any successful web scraping project. However, web data is often filled with inconsistencies, duplicates, and irrelevant content, making it hard to work with straight out of the source.

That’s where combining ScraperAPI with data cleaning pipelines comes in. ScraperAPI helps you reliably extract data from websites—even those with complex anti-scraping protections—while Python’s data tools make it easy to clean, structure, and prepare that data for use.

In this guide, you’ll learn how to:

  • Set up ScraperAPI for web scraping
  • Use ETL (Extract, Transform, Load) techniques to clean and organize your data
  • Integrate these tools into a workflow that’s fast, flexible, and scalable

Ready? Let’s get started!

What is ETL?

ETL stands for Extract, Transform, Load. It is a data processing framework used to move data from one or more sources, clean it, and store it in a system where it can be analyzed. This process is essential for handling large volumes of data from various sources, preparing it for reporting and informed decision-making.

The Three Stages of ETL

  • Extract: In this initial phase, raw data is gathered from its source. In our case, that means scraping websites. This can be tricky, as websites often implement anti-scraping measures like IP bans, CAPTCHAs, and dynamic content loading through JavaScript. To manage these challenges and streamline the extraction process, we’ll use ScraperAPI, a tool designed to simplify and automate data collection at scale.
  • Transform: Once data is extracted, it’s often messy or inconsistent. The real cleanup happens in the transformation stage: the data is validated, standardized, and restructured into a usable format. This is a crucial step for ensuring data quality and consistency.
  • Load: Finally, the cleaned and transformed data is loaded into a storage system. Depending on the project, this could be a CSV file, a relational database (like PostgreSQL or MySQL), a NoSQL database (like MongoDB), a data warehouse (like BigQuery, Redshift, or Snowflake), or even a data lake. We’ll keep this tutorial simple and load the data into a CSV file.

ScraperAPI and Python for ETL

Web-scraped data is often messy, inconsistent, and unstructured—not yet ready for analysis or decision-making. That’s where ETL becomes essential. It brings structure, cleanliness, and reliability to chaotic web data, making it more valuable.

Let’s break down how this works in the context of scraping real estate listings:

  • Extract: Use ScraperAPI to pull raw HTML from multiple real estate website pages. ScraperAPI handles the toughest parts of web scraping—IP rotation, user-agent spoofing, CAPTCHA solving, and even JavaScript rendering—so you can focus on getting the data instead of fighting anti-bot defenses.
  • Transform: With libraries like BeautifulSoup and Pandas, you can clean and standardize your data for analysis using Python:
    • Parse price fields, stripping currency symbols and converting values to a numeric format.
    • Standardize inconsistent text (e.g., “3 bdr”, “three beds”) into a single format like (e.g., 3).
    • Normalize square footage to a consistent unit and data type.
    • Handle missing values for features such as balconies or garages.
    • Identify and remove duplicate listings that may appear due to frequent site updates.
  • Load:
    Once the data is cleaned and transformed, use Pandas to export it into a structured format like a CSV for reporting or analysis, or load it directly into a database for long-term storage and querying.

With Python and ScraperAPI together, you have a powerful ETL toolkit:

  • ScraperAPI simplifies and hardens the Extract phase.
  • With its rich data handling capabilities, Python covers Transform and Load with flexibility and precision.

This ETL pipeline guarantees that the data you have scraped is precise, consistent, and prepared for use, regardless of whether you are analyzing market trends or creating a real estate dashboard.

Project Requirements

Before diving into the integration, make sure you have the following:

1. A ScraperAPI Account: Sign up on the ScraperAPI website to get your API key. ScraperAPI will handle proxy rotation, CAPTCHA solving, and JavaScript rendering, making the extraction phase a breeze. You’ll receive 5,000 free API credits when you sign up for a seven-day trial, starting whenever you’re ready.  

2. A Python Environment: Ensure Python (version 3.7+ recommended) is installed on your system. You’ll also need to install key libraries:

  • requests: For making HTTP requests to ScraperAPI.
  • beautifulsoup4: For parsing HTML and XML content.
  • pandas: For data manipulation and cleaning.
  • python-dotenv: to load your credentials from your .env file and manage your API key securely.
  • lxml (optional but recommended): A fast and efficient XML and HTML parser that BeautifulSoup can use.

You can install them using pip with this command:

pip install requests beautifulsoup4 pandas lxml python-dotenv

3. Basic Web Scraping Knowledge: A foundational understanding of HTML structure, CSS selectors, and how web scraping works will be beneficial.

4. An IDE or Code Editor: Such as VS Code, PyCharm, or Jupyter Notebook for writing and running your Python scripts.

 

TL;DR;

For those in a hurry, here’s the full scraper we are going to be building:

import os
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from dotenv import load_dotenv

# === Load environment variables from .env file ===
load_dotenv()
SCRAPER_API_KEY = os.getenv('SCRAPER_API_KEY')
IDEALISTA_URL = os.getenv('IDEALISTA_URL')
SCRAPER_API_URL = f"http://api.scraperapi.com/?api_key={SCRAPER_API_KEY}&url={IDEALISTA_URL}"


# === Extract ===
def extract_data(url):
    response = requests.get(url)
    extracted_data = []

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        listings = soup.find_all('article', class_='item')

        for listing in listings:
            title = listing.find('a', class_='item-link').get('title')
            price = listing.find('span', class_='item-price').text.strip()

            item_details = listing.find_all('span', class_='item-detail')
            bedrooms = item_details[0].text.strip() if item_details and item_details[0] else "N/A"
            area = item_details[1].text.strip() if len(item_details) > 1 and item_details[1] else "N/A"

            description = listing.find('div', class_='item-description')
            description = description.text.strip() if description else "N/A"

            tags = listing.find('span', class_='listing-tags')
            tags = tags.text.strip() if tags else "N/A"

            images = [img.get("src") for img in listing.find_all('img')] if listing.find_all('img') else []

            extracted_data.append({
                "Title": title,
                "Price": price,
                "Bedrooms": bedrooms,
                "Area": area,
                "Description": description,
                "Tags": tags,
                "Images": images
            })
    else:
        print(f"Failed to extract data. Status code: {response.status_code}")

    return extracted_data


# === Transform ===
def transform_data(data):
    df = pd.DataFrame(data)

    df['Price'] = (
        df['Price']
        .str.replace('€', '', regex=False)
        .str.replace(',', '', regex=False)
        .str.strip()
        .astype(float)
    )

    def extract_bedrooms(text):
        match = re.search(r'\d+', text)
        return int(match.group()) if match else None

    df['Bedrooms'] = df['Bedrooms'].apply(extract_bedrooms)

    df['Area'] = (
        df['Area']
        .str.replace('m²', '', regex=False)
        .str.replace(',', '.', regex=False)
        .str.strip()
        .astype(float)
    )

    df.dropna(subset=['Price', 'Bedrooms', 'Area'], inplace=True)
    df = df[df['Bedrooms'] == 3]

    return df


# === Load ===
def load_data(df, filename='three_bedroom_houses.csv'):
    df.to_csv(filename, index=False)
    print(f"Saved {len(df)} listings to {filename}")


# === Main pipeline ===
def main():
    print("Starting ETL pipeline for Idealista listings...")

    raw_data = extract_data(SCRAPER_API_URL)
    if not raw_data:
        print("No data extracted. Check your API key or target URL.")
        return

    print(f"Extracted {len(raw_data)} listings.")

    cleaned_data = transform_data(raw_data)
    print(f"{len(cleaned_data)} listings after cleaning and filtering.")

    load_data(cleaned_data)


if __name__ == "__main__":
    main()

Want to see how we built it? Keep reading!

Building a Real Estate ETL Pipeline with ScraperAPI and Python

In this section, we’ll build a working ETL pipeline that scrapes real estate listings from Idealista using ScraperAPI, cleans the data with Python, and saves it in a structured CSV file. We’ll walk through each part of the process—extracting the data, transforming it into a usable format, and loading it for analysis—so you’ll have a complete and reusable workflow by the end.

Step 1: Extracting: Using ScraperAPI 

Most real estate websites are known for blocking scrapers, making collecting data at any meaningful scale challenging. For that reason, we sent our get() requests through ScraperAPI, effectively bypassing Idealista’s anti-scraping mechanisms without complicated workarounds.
For this guide, we’ll update an existing ScraperAPI real estate project to demonstrate the integration. You can find the complete guide on scraping Idealista with Python here.

import json
from datetime import datetime
import requests
from bs4 import BeautifulSoup

scraper_api_key = 'YOUR_SCRAPERAPI_KEY' # Replace with your ScraperAPI key
idealista_query = "https://www.idealista.com/en/venta-viviendas/barcelona-barcelona/"
scraper_api_url = f'http://api.scraperapi.com/?api_key={scraper_api_key}&url={idealista_query}'
 
response = requests.get(scraper_api_url)

extracted_data = []

# Check if the request was successful (status code 200)
if response.status_code == 200:
   # Parse the HTML content using BeautifulSoup
   soup = BeautifulSoup(response.text, 'html.parser')
   # Extract each house listing post
   house_listings = soup.find_all('article', class_='item')
  
   # Create a list to store extracted information
  
   # Loop through each house listing and extract information
   for index, listing in enumerate(house_listings):
       # Extracting relevant information
      title = listing.find('a', class_='item-link').get('title')
      price = listing.find('span', class_='item-price').text.strip()

       # Find all div elements with class 'item-detail'
      item_details = listing.find_all('span', class_='item-detail')

       # Extracting bedrooms and area from the item_details
      bedrooms = item_details[0].text.strip() if item_details and item_details[0] else "N/A"
      area = item_details[1].text.strip() if len(item_details) > 1 and item_details[1] else "N/A"
      description = listing.find('div', class_='item-description').text.strip() if listing.find('div', class_='item-description') else "N/A"
      tags = listing.find('span', class_='listing-tags').text.strip() if listing.find('span', class_='listing-tags') else "N/A"
       # Extracting images
      image_elements = listing.find_all('img')
      images = [img.get("src") for img in image_elements] if image_elements else []
 
       # Store extracted information in a dictionary
      listing_data = {
           "Title": title,
           "Price": price,
           "Bedrooms": bedrooms,
           "Area": area,
           "Description": description,
           "Tags": tags,
           "Images": images
       }
       # Append the dictionary to the list
      extracted_data.append(listing_data)

The code above scrapes and parses real estate listings from Idealista using ScraperAPI and BeautifulSoup. It begins by configuring ScraperAPI with your ScraperAPI key and the target URL, then sends an HTTP GET request to the URL. If the request is successful, the HTML is parsed with BeautifulSoup, and the script locates all <article> elements with the class "item" (which represent property listings). It then loops through each listing to extract key details—title, price, number of bedrooms, area, description, tags, and image URLs.

Step 2: Transforming the Data (Data Cleaning)

After extracting raw data from Idealista, the next step is to clean and prepare it. To make this data more useful, we’ll use pandas, a powerful Python library for data analysis. If you’ve never used pandas before, think of it like Excel—only it’s in Python and is more flexible.
In Step 1, we stored each listing in a dictionary and added those dictionaries to a list called extracted_data. Here’s what that list might look like:

[
    {
        "Title": "Spacious apartment in central Barcelona",
        "Price": "€350,000",
        "Bedrooms": "3 bdr",
        "Area": "120 m²",
        "Description": "...",
        "Tags": "Luxury",
        "Images": [...]
    },
    ...
]

Now we’ll use pandas to convert that list into a structured DataFrame (a table-like object), then clean each column step by step.

import pandas as pd

# Convert raw listing data to a DataFrame
df = pd.DataFrame(three_bedroom_listings)

# View the raw data
print(df.head())
  • pd.DataFrame(...) creates a DataFrame from a list of dictionaries. Each dictionary becomes a row; each key becomes a column.
  • .head() shows the first five rows — useful for checking structure and data types.

The price values are strings like "€350,000". We’ll remove symbols and formatting to convert them to numeric values.

df['Price'] = (
    df['Price']
    .str.replace('€', '', regex=False)   # Remove the euro symbol
    .str.replace(',', '', regex=False)   # Remove comma separators
    .str.strip()                         # Remove leading/trailing whitespace
    .astype(float)                       # Convert strings to float
)


print(df['Price'].head()) # Display the first few prices to verify conversion
  • .str.replace(old, new) modifies string values in a column.
  • .str.strip() removes unnecessary spaces from both ends.
  • .astype(float) changes the column type from string to float so we can perform numerical operations later.

Listings may include text like "3 bdr" or "two beds". We’ll extract just the number of bedrooms as an integer using a regex function with .apply().

import re

def extract_bedrooms(text):
    match = re.search(r'\d+', text)  # Find the first sequence of digits
    return int(match.group()) if match else None

df['Bedrooms'] = df['Bedrooms'].apply(extract_bedrooms)


print(df['Bedrooms'].head()) # Display the first few bedroom counts to verify conversion
  • .apply() runs a function on each element in the column.
  • re.search(r'\d+', text) looks for the first group of digits.

This cleans and standardizes the bedroom count into integers.Area values include units like "120 m²". We’ll remove those and convert to float.

df['Area'] = (
    df['Area']
    .str.replace('m²', '', regex=False)  # Remove unit
    .str.replace(',', '.', regex=False)  # Convert comma to dot for decimal values
    .str.strip()                         # Clean up whitespace
    .astype(float)                       # Convert to float
)


print(df['Area'].head())  # Display the first few areas to verify conversion

This ensures all values in the “Area” column are consistent numerical types so that we can sort, filter, or calculate metrics like price per square meter.

Some listings may be missing essential values. We’ll drop rows with missing data in key columns. You can choose which columns are crucial and should not have any missing values.

df.dropna(subset=['Price', 'Bedrooms', 'Area'], inplace=True)
  • .dropna() removes rows with NaN (missing) values.
  • The subset argument limits this check to specific columns; you can add other columns here if needed.
  • inplace=True modifies the DataFrame directly without needing to reassign it.

To work with only listings that have exactly 3 bedrooms (optional):

df = df[df['Bedrooms'] == 3]
  • df[condition] filters rows based on a condition.
  • Here, we’re checking where the “Bedrooms” column equals 3, and updating df to only include those rows.

At this point, your data is structured similarly to this:

Title Price Bedrooms Area
“Modern flat in Eixample” 310000.00 3 95.0
“Loft with terrace in Gracia” 275000.00 2 82.0

This cleaned DataFrame is now ready for analysis or export. In the next step, we’ll load it into a CSV file.

Step 3: Loading Cleaned Data into CSV (Storing)

With your data now cleaned and structured in a pandas DataFrame, the final step is to persist it, meaning you save it somewhere so it can be reused, shared, or analyzed later.

The CSV file is the most common and beginner-friendly format for storing tabular data. It’s a simple text file where each row is a line and commas separate each column. Most tools—Excel, Google Sheets, data visualization tools, and programming languages—can open and process CSV files efficiently.

You can save your DataFrame to a CSV with just one line of code:

# Save the cleaned DataFrame to a CSV file
df.to_csv('three_bedroom_houses.csv', index=False)
  • df.to_csv(...) is a pandas method that writes your DataFrame to a CSV file.
  • 'three_bedroom_houses.csv' is the file name that will be created (or overwritten).
  • index=False tells pandas not to write the DataFrame index (row numbers) to the file, which keeps it clean unless you explicitly need it.

Once this is done, you’ll see a new file in your working directory (where your script is running). Here’s what a few lines of that file might look like:

Title,Price,Bedrooms,Area,Description,Tags,Images
"Flat / apartment in calle de Bailèn, La Dreta de l'Eixample, Barcelona",675000.0,3,106.0,"Magnificent and quiet brand new refurbished flat in Eixample.
This ready-to-live-in flat enjoys a fantastic location very close to the popular Paseo Sant Joan and the pedestrian street Consell de Cent. It is a very pleasant urban environment in which to live in the neighbourhood, with numerous services, shops, restau",N/A,"['https://img4.idealista.com/blur/480_360_mq/0/id.pro.es.image.master/dd/d0/85/1326281103.jpg', 'https://st3.idealista.com/b1/b8/d4/bcn-advisors.gif']"

You can open it in:

  • Excel: Just double-click the file.
  • Google Sheets: Upload the file and import it as a spreadsheet.
  • Another Python script: Using pd.read_csv()
  • Visualization tools: Like Power BI, Tableau, or even Jupyter notebooks.

If you’re working with a larger dataset later or need better performance, consider saving to a database. But for now, CSV is ideal.

Step 4: Finalizing the ETL Pipeline

Now that your scraper works and your data is clean, it’s time to turn your code into a proper ETL pipeline. This makes it easier to maintain, reuse, schedule, or extend. We’ll do two things here:

1. Modularize the script into extract, transform, and load functions

2. Move sensitive info like your ScraperAPI key and target URL to environment variables using the python-dotenv package

This final version is production-friendly, secure, and easy to build on.First, install python-dotenv if you don’t already have it:

pip install python-dotenv

Next, create a .env file in your project directory and add any sensitive information:

SCRAPER_API_KEY=your_scraperapi_key_here
IDEALISTA_URL=https://www.idealista.com/en/venta-viviendas/barcelona-barcelona/

Here’s your final pipeline script, with the code restructured and organized in separate methods:

import os
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from dotenv import load_dotenv

# === Load environment variables from .env file ===
load_dotenv()
SCRAPER_API_KEY = os.getenv('SCRAPER_API_KEY')
IDEALISTA_URL = os.getenv('IDEALISTA_URL')
SCRAPER_API_URL = f"http://api.scraperapi.com/?api_key={SCRAPER_API_KEY}&url={IDEALISTA_URL}"


# === Extract ===
def extract_data(url):
    response = requests.get(url)
    extracted_data = []

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        listings = soup.find_all('article', class_='item')

        for listing in listings:
            title = listing.find('a', class_='item-link').get('title')
            price = listing.find('span', class_='item-price').text.strip()

            item_details = listing.find_all('span', class_='item-detail')
            bedrooms = item_details[0].text.strip() if item_details and item_details[0] else "N/A"
            area = item_details[1].text.strip() if len(item_details) > 1 and item_details[1] else "N/A"

            description = listing.find('div', class_='item-description')
            description = description.text.strip() if description else "N/A"

            tags = listing.find('span', class_='listing-tags')
            tags = tags.text.strip() if tags else "N/A"

            images = [img.get("src") for img in listing.find_all('img')] if listing.find_all('img') else []

            extracted_data.append({
                "Title": title,
                "Price": price,
                "Bedrooms": bedrooms,
                "Area": area,
                "Description": description,
                "Tags": tags,
                "Images": images
            })
    else:
        print(f"Failed to extract data. Status code: {response.status_code}")

    return extracted_data


# === Transform ===
def transform_data(data):
    df = pd.DataFrame(data)

    df['Price'] = (
        df['Price']
        .str.replace('€', '', regex=False)
        .str.replace(',', '', regex=False)
        .str.strip()
        .astype(float)
    )

    def extract_bedrooms(text):
        match = re.search(r'\d+', text)
        return int(match.group()) if match else None

    df['Bedrooms'] = df['Bedrooms'].apply(extract_bedrooms)

    df['Area'] = (
        df['Area']
        .str.replace('m²', '', regex=False)
        .str.replace(',', '.', regex=False)
        .str.strip()
        .astype(float)
    )

    df.dropna(subset=['Price', 'Bedrooms', 'Area'], inplace=True)
    df = df[df['Bedrooms'] == 3]

    return df


# === Load ===
def load_data(df, filename='three_bedroom_houses.csv'):
    df.to_csv(filename, index=False)
    print(f"Saved {len(df)} listings to {filename}")


# === Main pipeline ===
def main():
    print("Starting ETL pipeline for Idealista listings...")

    raw_data = extract_data(SCRAPER_API_URL)
    if not raw_data:
        print("No data extracted. Check your API key or target URL.")
        return

    print(f"Extracted {len(raw_data)} listings.")

    cleaned_data = transform_data(raw_data)
    print(f"{len(cleaned_data)} listings after cleaning and filtering.")

    load_data(cleaned_data)


if __name__ == "__main__":
    main()

With this final step, your scraper is now:

  • Modular and easy to update
  • Secure, with API keys safely stored in environment variables
  • Ready to scale, automate, or plug into larger data workflows

You now have a reusable, scalable workflow for scraping and analyzing real estate listings!

Use Cases for ScraperAPI and Python’s Data Cleaning Integration

Now that you’ve seen how ScraperAPI and Python work together to extract and clean real estate data, let’s explore how this powerful combination can be used across industries. The ETL workflow—Extract, Transform, Load—is flexible and scalable, making it useful for many data-driven projects.

Here are several practical applications where this integration excels:

1. Sentiment analysis: You can look at how language affects buyer interest by scraping property descriptions or user reviews. After cleaning the text with Python, sentiment analysis tools like TextBlob or VADER can score the tone as positive, neutral, or negative. This makes it possible to see whether listings that use appealing terms like “spacious” or “modern” tend to sell faster or command higher prices.

2. Trend Monitoring: Running your scraper regularly helps build a dataset that captures how property prices and features change over time. It’s easier to visualize trends and track how specific market segments are evolving by structuring the data around key attributes like location, number of bedrooms, or property type.

3. Competitor Research: Scraping listings from multiple real estate platforms gives you a direct view of competitors’ prices and positions of similar properties. With standardized data, you can compare pricing strategies, listing frequency, and included features to identify market gaps or specific areas where your offering could stand out.

4. Community Insights: Collecting data from forums, review sites, or social media conversations can reveal what buyers and renters care about. After cleaning and processing the text, analysis can uncover common priorities: proximity to schools, demand for green space, or concerns about noise, etc., which can inform development and marketing decisions.

Wrapping Up

Integrating ScraperAPI with data-cleaning pipelines creates a powerful setup for working with web data. ScraperAPI takes care of the tricky parts of scraping—like CAPTCHAs, IP blocks, and JavaScript rendering—so you can reliably extract data at scale. On the other side, Python helps you clean and organize that data, making sure it’s accurate, consistent, and ready for analysis. This combination saves time and makes it easier to get real insights from messy, real-world data.

In this tutorial, we walked through the process of:

  • Extracting real estate listings from Idealista using ScraperAPI
  • Transforming the data by standardizing data types, removing unwanted characters and empty values, and filtering for three-bedroom listings.
  • Loading the cleaned data into a structured CSV file for easy sharing and analysis

If you’d like to try it for yourself, you can sign up for a free ScraperAPI account and get 5,000 API credits to start scraping right away. It’s a great way to test the waters and see how it fits into your data workflows.

Until next time, happy scraping!

FAQs

Integrating ScraperAPI into your ETL pipeline simplifies data extraction by handling anti-scraping mechanisms like IP bans, CAPTCHAs, and JavaScript rendering. This ensures uninterrupted data collection, even from complex or heavily protected websites. ScraperAPI also reduces the need for manual workarounds, allowing you to focus on transforming and analyzing the data.

To ensure data accuracy in your ETL pipeline, start by validating the extracted data using Python tools like pandas to check for missing values, duplicates, or inconsistent formats. Clean the data by standardizing date formats, currency symbols, and numeric values. Regularly test your scraping logic to ensure it adapts to website structure changes. Always review a sample of the scraped output to manually confirm that the data matches expectations before scaling your pipeline.

ScraperAPI can extract a wide variety of data types for ETL pipelines, including plain text such as product descriptions, blog content, or property listings; numerical data like prices, ratings, and financial figures; media files including images and videos; structured data such as HTML tables, lists, and JSON or XML feeds; and dynamic content loaded via JavaScript or AJAX. This flexibility suits everything from basic web scraping to complex data aggregation projects.

About the author

Picture of Ize Majebi

Ize Majebi

Ize Majebi is a Python developer and data enthusiast who delights in unraveling code intricacies and exploring the depths of the data world. She transforms technical challenges into creative solutions, possessing a passion for problem-solving and a talent for making the complex feel like a friendly chat. Her ability brings a touch of simplicity to the realms of Python and data.

Related Articles

Talk to an expert and learn how to build a scalable scraping solution.