Understanding Python Multithreading and Multiprocessing by visualization
If you want to know the visualization method, please refer to this article. I will only present the results and my findings. Or you can find the code on my Github.
Why do we need to understand parallel processing?
I think the direct reason is we can use computational resources sufficiently on our machines. It is meant to reduce the overall processing time. But not every task is suitable for parallel processing. Even it is feasible, ways of parallel processing have different timing to apply.
Explain the principle of parallel processing
In Python, we have two ways to use parallel operations. One is multithreading, and the other is multiprocessing. I draw a diagram to display how multithreading and multiprocessing work roughly.
The critical difference between multithreading and multiprocessing
Different ways to increasing computing power
Multiprocessing adds CPUs(or processors) for increasing computing power. And multithreading creates many threads in a single processor to get the same effect.
Multiprocessing process creation is time-consuming work. While in Multithreading, process creation is economical.
Multiprocessing will copy a virtual memory space to each processor. But multithreading does not. It let each thread share a memory space.
Let me show you some cases.
I use two evaluation indexes. One is multithreading or multiprocessing whether can reduce computation time. The other is which way is better for reduce run time. It can help you to understand application timing for multithreading and multiprocessing.
API Calls (download task):
from urllib.request import urlopen
url_N = 10
URL='http://scholar.princeton.edu/sites/default/files/oversize_pdf_test_0.pdf'
urls = [URL for i in range(url_N)]
resp = urlopen(url)
Get time benefit from both multithreading and multiprocessing. The performances seem almost equal, but multithreading doesn’t need to create more processes (use less memory). So multithreading is better.
IO Heavy (write a big string into a file)
TEXT = ''.join(random.choice(string.ascii_lowercase) for i in range(10**7*5))
text_N = 12f = open('output.txt', 'wt', encoding='utf-8')
f.write(text)
f.close()
Multithreading is better than multiprocessing. It seems like multithreading can parallel processing without waiting for other tasks.
Numpy Functions
In this case, multithreading and multiprocessing can both reduce run time. Using multiprocessing, we find that process creation is pretty time-consuming when your NumPy array is enormous. So depend on total run time, multithreading is better than multiprocessing no matter whether you use Numpy to Addition or Dot Product. But when tasks are many, I think multiprocessing might be better.
- Numpy Addition
DIMS = 8000
numpy_add_N = 20
DIMS_ARR = [DIMS for i in range(numpy_add_N)]
a = np.random.rand(DIMS,DIMS)
b = np.random.rand(DIMS,DIMS)
res = a + b
2. Dot Product
DIMS = 1500
dot_N = 10
DIMS_ARR = [DIMS for i in range(dot_N)]
a = np.random.rand(DIMS,DIMS)
b = np.random.rand(DIMS,DIMS)
res = np.dot(a,b)
CPU Intensive
When we meet CPU-intensive cases, let several processors run simultaneously will improve your speed. Even you need to create many processes. Multiprocessing is still better than multithreading. Of course, this is base on the fact that the memory capacity of your current computing environment is not significant.
n = 10**7
ITERS = 10count = 0
for i in range(n):
count += i
Conclusions
Sometimes you can’t easily define what type of your case is. So maybe you can quickly run a sample and find out which parallel processing way is better. I hope you guys can benefit from this article.
In the end, I want to recommend an article. It uses good diagrams to tell readers how multithreading and multiprocessing work. And the complete code will be in my Github. Thanks.