Useful Web Scraping Linux Tools

When running a web scraper, you probably want to monitor your disk usage and computer resources (memory / cpu), as well as watch a log file.

You probably also want to run a command for a few hours, or days - so you should use a terminal multiplexer - to use multiple terminal sessions at the same time. This means if your session is disconnected, you can reconnect later. This also means you can run multiple commands at the same time.

The following points are explained in more detail on the previous page.

  • To monitor disk usage, use the df -h command.
  • To look at currently used resources, you can use top - or install htop for a more readable format.
  • To watch changes being made to a log file, use tail -f filename.log.
  • You'll need to use SSH to connect to the server, and probably Rsync or SCP to copy files to / from the server.

So, a typical workflow might be:

  1. SSH into your server
  2. Fire up a tmux session: tmux
  3. Run your web scraper ./my_awesome_go_program arg0 arg1 > log.file
  4. Detach from the tmux session: ctrl + b, d
  5. View the output of the log file: tail -f log.file
  6. View cpu / ram usage, adjust threads as appropriate: htop
  7. Ensure you don't run out of disk space (eg if your logging is too verbose and you are running on a server with a small disk): df -h

Checkout our Web Scraping Service