Website owners often wish to block web scrapers for a variety of reasons, particularly when they spend a large amount of money generating the data which is being scraped.

If you can see it, then you can scrape it

The truth is, 'if you can see it, you can scrape it'. Despite all anti-bot measures put in place, even using complex CDN's such as CloudFlare, an advanced scraper can bypass blocking technology and scrape any content a human could see using a normal browser. This is why many companies have resorted to suing those scraping their services, rather than attempt to use more complex anti-scraping measures, as the latter is often more complicated.

You can implement anti-scraping measures with a range of techniques:

  • Requiring a user account to access the data (a bot could still create user accounts however, so you would need some anti-bot protection for user accounts)
  • Blocking certain user agents or other headers
  • Blocking clients which don't use Javascript
  • Using CAPTCHA's
  • Blocking or rate limiting IP's which send too many requests in a short period
  • Blocking knowns bad IP's and proxies that are known for scraping.
  • Use CDN's with anti-bot technology such as CloudFlare

The above methods will block a number of simple scrapers, however will not block all advanced scrapers. If you want to ensure 100% that no one scrapes your data, you should keep it off the internet. However, you can make it incredibly challenging for scrapers by using a number of the above techniques.