Turn webpages into LLM-ready data at scale with a simple API call

How to Use ScraperAPI with HtmlUnit in Java

ScraperAPI is a powerful scraping tool that handles proxies, browsers, and CAPTCHAs automatically. In this guide, you’ll learn how to integrate ScraperAPI with HtmlUnit, a fast and lightweight headless browser for Java.

Getting Started

Before we integrate ScraperAPI, here’s a basic HtmlUnit scraping example:

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class BasicHtmlUnit {
   public static void main(String[] args) throws Exception {
       WebClient client = new WebClient(BrowserVersion.CHROME);
       client.getOptions().setCssEnabled(false);
       client.getOptions().setJavaScriptEnabled(false);

       HtmlPage page = client.getPage("https://httpbin.org/ip");
       System.out.println(page.asNormalizedText());
       client.close();
   }
}

This works for basic scraping but does not solve problems like IP bans or captchas.

Integration Methods

Recommended: API Endpoint Method

The best way to use ScraperAPI with HtmlUnit is to call the API endpoint directly and pass the target URL as a query parameter. This ensures your request routes through ScraperAPI’s proxy network with built-in CAPTCHA handling.

Required Setup

1. Install Java (if not already installed)
# Ubuntu
sudo apt-get update
sudo apt-get install default-jdk 

# MacOS
brew install openjdk@2

Then add to your shell config (e.g. .zshrc or .bash_profile):

export JAVA_HOME="/Library/Java/JavaVirtualMachines/temurin-21.jdk/Contents/Home"
export PATH="$JAVA_HOME/bin:$PATH"

echo 'export JAVA_HOME="/Library/Java/JavaVirtualMachines/temurin-21.jdk/Contents/Home"' >> ~/.bash_profile
echo 'export PATH="$JAVA_HOME/bin:$PATH"' >> ~/.bash_profile

Reload your shell:

source ~/.zshrc
# or
source ~/.bash_profile

Confirm Java is installed:

java -version
2. Install Maven
# Ubuntu
sudo apt update
sudo apt install maven

# MacOS
brew install maven

Check:

mvn -v
3. Set Up Project Structure

Create a folder and initialize the Maven project:

mkdir htmlunit-scraperapi && cd htmlunit-scraperapi

Inside, create the structure:

src/
 main/
   java/
     MarketPrice.java
4. Add Dependencies in pom.xml

At the root of your project folder, create a file pom.xml and paste the following:

<project xmlns="http://maven.apache.org/POM/4.0.0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
        http://maven.apache.org/xsd/maven-4.0.0.xsd">
  
   <modelVersion>4.0.0</modelVersion>
   <groupId>com.scraperapi</groupId>
   <artifactId>htmlunit-scraperapi</artifactId>
   <version>1.0-SNAPSHOT</version>

   <dependencies>
       <!-- HtmlUnit for headless browser -->
       <dependency>
           <groupId>net.sourceforge.htmlunit</groupId>
           <artifactId>htmlunit</artifactId>
           <version>2.70.0</version>
       </dependency>

       <!-- Java dotenv to read .env variables -->
       <dependency>
           <groupId>io.github.cdimascio</groupId>
           <artifactId>java-dotenv</artifactId>
           <version>5.2.2</version>
       </dependency>
   </dependencies>

   <build>
       <plugins>
           <!-- Plugin to run Java classes with main method -->
           <plugin>
               <groupId>org.codehaus.mojo</groupId>
               <artifactId>exec-maven-plugin</artifactId>
               <version>3.1.0</version>
               <configuration>
                   <mainClass>MarketPrice</mainClass>
               </configuration>
           </plugin>
       </plugins>
   </build>
</project>
5. Add .env File in Root

In the same folder, create a .env file:

SCRAPERAPI_KEY=your_api_key_here

You can get your ScraperAPI key here.

Full Working Code

Paste this inside MarketPrice.java:

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.DomNode;
import com.gargoylesoftware.htmlunit.html.DomNodeList;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import io.github.cdimascio.dotenv.Dotenv;

import java.io.IOException;

public class MarketPrice {
   public static void main(String[] args) throws IOException {
       // Load ScraperAPI key from .env
       Dotenv dotenv = Dotenv.load();
       String apiKey = dotenv.get("SCRAPERAPI_KEY");

       if (apiKey == null || apiKey.isEmpty()) {
           System.err.println("SCRAPERAPI_KEY is missing in your .env file.");
           return;
       }

       // Target a real HTML site
       String targetUrl = "https://quotes.toscrape.com";
       String scraperApiUrl = String.format("http://api.scraperapi.com?api_key=%s&url=%s",
               apiKey, targetUrl);

       // Initialize headless browser
       WebClient webClient = new WebClient(BrowserVersion.CHROME);
       webClient.getOptions().setUseInsecureSSL(true);
       webClient.getOptions().setCssEnabled(false);
       webClient.getOptions().setJavaScriptEnabled(false);
       webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
       webClient.getOptions().setThrowExceptionOnScriptError(false);

       // Fetch and parse page
       HtmlPage page = (HtmlPage) webClient.getPage(scraperApiUrl);

       DomNodeList<DomNode> quoteBlocks = page.querySelectorAll(".quote");

       System.out.println("\n📌 Scraped Quotes from https://quotes.toscrape.com:\n");

       for (DomNode quote : quoteBlocks) {
           String text = quote.querySelector(".text").asNormalizedText();
           String author = quote.querySelector(".author").asNormalizedText();
           DomNodeList<DomNode> tags = quote.querySelectorAll(".tags .tag");

           System.out.println("📝 Quote: " + text);
           System.out.println("👤 Author: " + author);
           System.out.print("🏷️  Tags: ");
           for (DomNode tag : tags) {
               System.out.print(tag.asNormalizedText() + " ");
           }
           System.out.println("\n------------------------------------------\n");
       }

       webClient.close();
   }
}

Make sure to set your API key in the environment variable SCRAPERAPI_KEY.

Not Recommended: Proxy Mode

HtmlUnit allows proxy configuration, but ScraperAPI uses query string authentication, which doesn’t work with HtmlUnit’s proxy model.

Why It Fails

  • ScraperAPI needs the API key in the URL query.
  • HtmlUnit proxy setup expects a static IP or basic authentication.

Error Output:

Use the API Endpoint method instead.

Optional Parameters

ScraperAPI supports various options via query parameters:

{
   Render = true,           // Load JavaScript
   CountryCode = "us",      // Use US IP
   Premium = true,          // Enable CAPTCHA solving
   SessionNumber = 123      // Maintain session across requests
};
Parameter What It Does When to Use It
render=true Tells ScraperAPI to execute JavaScript Use for SPAs and dynamic content
country_code=us Routes requests through US proxies Great for geo-blocked content
premium=true Enables CAPTCHA solving and advanced anti-bot measures Essential for heavily protected sites
session_number=123 Maintains the same proxy IP across requests Use when you need to maintain login sessions

These parameters cover most scraping scenarios. Check the ScraperAPI documentation for additional options.

Example:

String scraperApiUrl = String.format("http://api.scraperapi.com?api_key=%s&url=%s&render=true&country_code=us", apiKey, java.net.URLEncoder.encode(targetUrl, "UTF-8"));

Best Practices

  • Always store your ScraperAPI key in an environment variable
  • Use render=true when targeting JavaScript-heavy sites
  • Avoid using proxy settings in HtmlUnit
  • Implement retry logic when scraping large datasets
  • Disable JavaScript/CSS for better performance on static pages

Run the Scraper

Run your MarketPrice.java file using:

mvn compile exec:java -Dexec.mainClass=MarketPrice

Expected Output:

Your terminal should display structured quote data like this:

screenshot

This confirms ScraperAPI handled the request and routed it through its network.

Ready to start scraping?

Get started with 5,000 free API credits or contact sales

No credit card required