Efficiency

Time and time again I read web scraping tutorials which encourage you to execute Javascript - by running a headless browser - in order to load all required parts of the page for your HTML parser.

Of course sometimes there is not escaping it - you do need to act as a browser and to follow links and read content dynamically, however this can often be avoided, and wastes a truckload of resources!

Generally if you are scraping data from a single website - ie, you are writing a scraper to collect highly specific data requirements - you don't need to emulate Javascript.

You can just send plain old HTTP requests (GET and POST) to website or API endpoints.

Most modern web applications have a JSON API and Javascript front end application. You can use this to your advantage - by interfacing directly with the API endpoints.

This is not only thousands of times more efficient, you often will find some already nicely structured data (JSON), which you can plug directly into your program.

So stop emulating Javascript, and start learning how to interface directly with API endpoints.

A good way to start is to open the Chrome / Firefox developer tools, and click on the network tab. If you are using a dynamic application, you will be able to observe the requests and responses between your browser and the web app's server. Some of these responses probably contain the data you are after. Right click on a request - copy as a curl command - and you are already half way there :-)


Checkout our Web Scraping Service