February 10, 2019
At my company Loceye, we are scraping the web all day long to create datasets and rule the world. Or kind of.
Recently I stumbled upon multi-processing and multi-threading. As I started building the scraping scripts with Selenium I was using Threads with no extra thought. And suddenly I was surprised not thinking at all about processes.
Was my initial code the best I could have in performance?
Yes, it was. Multi-threading is the answer to multiple Driver instances. If you want to understand why keep reading.
In Python we have two ways of running our piece of code in parallel. Processes and Threads. But there are some key differences in the way Python handles Processes or Threads and therefore there are different kind of benefits after them.
If we submit “jobs” to different threads, those jobs can be pictured as “sub-tasks” of a single process and those threads will usually have access to the same memory areas (i.e., shared memory). This approach can easily lead to conflicts in case of improper synchronization, for example, if processes are writing to the same memory location at the same time. In our case with Selenium, we usually don’t care about shared memory.
A safer approach (although it comes with an additional overhead due to the communication overhead between separate processes) is to submit multiple processes to completely separate memory locations (i.e., distributed memory): Every process will run completely independent from each other.
Briefly some bullets:
Now, the differences are clear and we can choose the winner. Of course threads are. And here is why.
This is a simple scheme of how Selenium works with Web-Drivers.
When we are running multiple Web-Drivers with Selenium and Python we are doing the following.
We are creating a web-driver object, which communicates with a Browser process itself. This is purely a I/O related task, and in I/O tasks
threads are the winners.
Let’s take for example a list of 5 URLs.
URLs = [...]
If we choose threads for each URL, we are going to spawn 5 threads inside the same process but also 5 processes by opening 5 Browser instances to control.
Total = 1 process (Python) with 5 threads + 5 processes (Browsers)
If we had chosen processes for each URL, then we were going to spawn 5 processes with multiple threads each, only for the python code execution, but additionally again 5 processes by opening 5 Browser instances to control, which would be a pure overhead for our OS. Moreover don’t forget that each process allocates it’s own isolated memory space!
Total = 5+1 processes (Python) + 5 processes (Browsers)
Selenium Web-drivers are all about I/O. They are controlling different processes, so their task is not CPU intensive at all. Using multi-processing is an overhead here, hurting our CPU but also eating our RAM.
On the other hand multi-threading is heavily suggested for I/O related task, like Selenium Web-Drivers and we can safely use them.