Parallel / Concurrent Programming

Parallel programming is when you run multiple things at the same time within a program.

For example, with web scraping - you may wish to speed things up by processing many requests and responses in simultaneously.

This allows you to collect large amounts of data in minutes, not hours - or hours, not days.

For larger scrapes, you will reduce your time from months to just a few days.

Parallel programming basics

Generally you will have a section within your code that takes longer than the rest - for example, sending and receiving requests.

Let's say you have a list of a million URL's to collect data from - stored in urls[], and a function DoWork(url string) which does the heavy lifting (sending and recieving requests, data cleaning and storing in a database).

To run these in parallel, you must first set a maximum number of threads. This number will be the maximum threads running at any given time, and must be configured properly as to not crash your computer.

The psuedocode might look something like this:

max_threads = 100
current_threads = 0
for val in range (urls)
    if current_threads < max_threads{
        Pause 
    }
    current_threads ++
    DoWorkInParallel(url)
}

function DoWorkInParallel(url string){
    //send requests
    //parse responses
    //when finished, decrement thread usage variable 
    current_threads --
}

Mutexes

A mutex is a variable which can only be accessed by one thread at a time, despite multiple threads wanting to use it. It has 'lock' and 'unlocked' states, allowing you to access it safely from multiple threads.

Common Problems

Deadlocks

A deadlock is a situation where all your threads are waiting for each other to complete - and nothing ever happens.

This happens when two or more competing threads are preventing each other from accessing a resource.

Starvation

Starvation is a situation where a thread is waiting for a resource, but new threads are accessing it and blocking the resources availability for a long period.

The thread ends up blocking for a long period. To solve this you may wish to attach a time based priority (ie favour older threads) to your program.

Race Conditions

A race condition is where you are not properly accessing a variable within a multi threaded program. For example, without using a mutex or similar to block and unblock whilst incrementing a global variable, you may end up with a different result each time you run the program.


Checkout our Web Scraping Service