At my company Loceye, we are scraping the web all day long to create datasets and rule the world. Or kind of.
Recently I stumbled upon multi-processing and multi-threading. As I started building the scraping scripts with Selenium I was using Threads with no extra thought. And suddenly I was surprised not thinking at all about processes.
Was my initial code the best I could have in performance?TL;DR
Yes, it was. Multi-threading is the answer to multiple Driver instances. If you want to understand why keep reading.
In Python we have two ways of running our piece of code in parallel. Processes and Threads. But there are some key differences in the way Python handles Processes or Threads and therefore there are different kind of benefits after them.
If we submit “jobs” to different threads, those jobs can be pictured as “sub-tasks” of a single process and those threads will usually have access to the same memory areas (i.e., shared memory). This approach can easily lead to conflicts in case of improper synchronization, for example, if processes are writing to the same memory location at the same time. In our case with Selenium, we usually don't care about shared memory.
A safer approach (although it comes with an additional overhead due to the communication overhead between separate processes) is to submit multiple processes to completely separate memory locations (i.e., distributed memory): Every process will run completely independent from each other.
Briefly some bullets:
- Created by the operating system to run programs
- Processes can have multiple threads
- Two processes can execute code simultaneously in the same python program
- Processes have more overhead than threads as opening and closing processes takes more time
- Sharing information between processes is slower than sharing between threads as processes do not share memory space. In python they share information by pickling data structures like arrays which requires IO time.
- Threads are like mini-processes that live inside a process
- They share memory space and efficiently read and write to the same variables
- Two threads cannot execute code simultaneously in the same python program (although there are workarounds*)
Now, the differences are clear and we can choose the winner. Of course threads are. And here is why.
This is a simple scheme of how Selenium works with Web-Drivers.
When we are running multiple Web-Drivers with Selenium and Python we are doing the following.
We are creating a web-driver object, which communicates with a Browser process itself. This is purely a I/O related task, and in I/O tasks
threads are the winners.
Let's take for example a list of 5 URLs.
URLs = [...]
If we choose threads for each URL, we are going to spawn 5 threads inside the same process but also 5 processes by opening 5 Browser instances to control.
Total = 1 process (Python) with 5 threads + 5 processes (Browsers)
If we had chosen processes for each URL, then we were going to spawn 5 processes with multiple threads each, only for the python code execution, but additionally again 5 processes by opening 5 Browser instances to control, which would be a pure overhead for our OS. Moreover don't forget that each process allocates it's own isolated memory space!
Total = 5+1 processes (Python) + 5 processes (Browsers)
Selenium Web-drivers are all about I/O. They are controlling different processes, so their task is not CPU intensive at all. Using multi-processing is an overhead here, hurting our CPU but also eating our RAM.
On the other hand multi-threading is heavily suggested for I/O related task, like Selenium Web-Drivers and we can safely use them.