Web scraping can be an arduous task at times – you might often feel as though you’re spending too much time trying to keep your scrapers online and unblocked and not enough time actually getting to grips with the data they collect. Fortunately for every potential problem there is a solution, and one of the most fundamental tools in the kit for any developer is the humble proxy. Understanding and utilising proxies properly is key to beating the IP blacklist and maintaining scraper uptime. Here are a few examples of different proxies to help you understand what to look for in a proxy service and what to avoid.
A Free Proxy, or Public Proxy, is a publicly accessed proxy server that is available for anyone to use – the kind you’ll find on a public proxy list from a simple Google search. This is likely most people’s first encounter with a proxy server, and while the promise of anonymous browsing can be enticing to some in most cases it is too good to be true. With an Open Proxy users can never truly guarantee that the server is not keeping a log of any activity and connections, ultimately defeating the purpose of masking your IP in the first place. Add to that client side scripts that sites employ often to work out the IP address of the user and you find yourself in a situation where running a proxy has done nothing at all to mask your IP.
Adding to that the fact that IPs found on public listings are almost always swiftly blacklisted by sites, as well as the sporadic uptime and unreliability that is often associated with these kinds of proxies, and you’ll quickly realise that it’s no fit at all if you’re trying to make any meaningful progress web scraping.
A Shared Proxy is the next step in security from an Open Proxy. Likely the cheapest on the market, a shared proxy does offer anonymity online but it’s aimed more at the private individual rather than someone looking for a lot of IPs to cycle through. In a shared system a pool of clients will all have access to the same private proxies – anonymity is essentially guaranteed but because you’re sharing your activity with other users you might find yourself unpredictably restricted on some sites due to the actions of another user.
If you want to avoid the blacklist – or at least predict it happening – you want to eliminate any other activity on the IPs that you use. As such signing up to a shared proxy for anything other than masking private browsing is asking for trouble.
The next rung up the ladder, a Dedicated Proxy or Private Proxy is simply a proxy server that is only used by a single individual at any time. The advantage here is reliability and stability – depending on your provider you’ll be guaranteed either total personal use at any time or the guarantee that only one user is active at any time. This means you don’t run into the issues that Shared Proxies have with IPs being potentially blocked or blacklisted seemingly at random depending on another users activity.
Whilst they offer more stability we’re still in private browsing territory as a site can still identify and block a standard IP forwarding proxy like this with relative ease, especially if it’s making a lot of API requests.
Moving into the base level of proxy service that can be useful for someone looking for a higher volume of available IP addresses, these are essentially IP farms located in data centers. These centers are capable of generating huge amounts of IP addresses that can be used and discarded quickly. These data centers also tend to have nice quick networks too so they often boast very quick proxy speeds. Unfortunately due to the nature of their creation, all the IPs hosted this way share the same sub-network of the data center they were generated in and as such can be quite easy for sites to spot and blanket ban.
Ultimately it’s a useful tool but one that has it’s clear limitations. The sheer volume alone can be hugely beneficial depending on what you’re using the proxies for, but there are methods less prone to sudden bans available for us to utilise.
The Residential Proxy is arguably the most secure form of proxy online, particularly for those of us that want to avoid the blacklists and keep an IP running for as long as possible. Essentially Residential Proxies are IPs that are generated for and assigned to physical devices by an ISP – so every time one connects to the internet it does so as a genuine device. Because they appear like genuine connections, it is incredibly difficult for a site to track them and block them especially if they are used in large networks – which means they are a perfect choice for web scraping as they offer great uptime and tend to stay off block lists for a long time.
The flipside however is if a residential proxy is being used to make lots of erratic requests a site might flag it and bog it down with Captcha requests. Additionally depending on the location of the IP a user might run into speed issues.
The final point to make is one of implementation – a proxy is only going to stay active if it can stay under the radar, and ultimately you need to switch out the IPs you use frequently to avoid getting anything flagged for suspicious activity. A rotating proxy service does just that – every time you make a request of any kind it will choose a different proxy IP to do it. This masks the suspicious nature of heavy traffic and keeps your proxy IPs from being flagged or even blacklisted or blocked.
There are many other problems to navigate than proxies of course, but we hope that this goes some way to helping you understand the very basics of how proxies work online so you can make more informed choices. If you have a web scraping job you’d like to talk to us about please contact us and we’ll get back to you within 24 hours. Happy scraping!