By 苏剑林 | October 27, 2019
Parallel acceleration via multi-processing or multi-threading is no longer a difficult task, and many readers have likely experienced it. Generally, we reach this conclusion: the speedup ratio of multi-processing rarely exceeds 1. In other words, when you use 10 processes to run a task in parallel, you generally only obtain a speedup of less than 10 times, and as the number of processes increases, this speedup ratio often becomes even lower.
Note that we just said "rarely exceeds 1," which implies that in our subconscious, we feel the speedup ratio should be at most 1. Theoretically, this is true—could 10 processes really achieve a 20-fold acceleration? Isn't that just a windfall? However, I did happen to encounter an example a few days ago where the speedup ratio was significantly greater than 1, so I would like to share it here.
My original task was to count word frequencies: I have many articles, we need to perform word segmentation (tokenization) on these articles, and finally summarize them into a word frequency table. A typical implementation looks like this:
tokens = {}
for text in texts:
for token in tokenize(text):
tokens[token] = tokens.get(token, 0) + 1
This implementation took about 20 minutes when I was counting the word frequencies for the entire THUCNews dataset.
Now, let's compare it with a multi-processing version. I explained the coding techniques for multi-processing in "Python Multi-processing Programming Tips". To make it reusable, I encapsulated it into a function:
from multiprocessing import Pool
def parallel_apply(func, iterable, workers, **kwargs):
if workers <= 1:
return [func(i, **kwargs) for i in iterable]
p = Pool(workers)
result = p.map(func, iterable)
p.close()
p.join()
return result
Using this function to perform multi-processed word frequency counting, the code looks roughly as follows:
def _tokenize_and_count(texts):
tokens = {}
for text in texts:
for token in tokenize(text):
tokens[token] = tokens.get(token, 0) + 1
return tokens
def _total_count(results):
tokens = {}
for result in results:
for token, count in result.items():
tokens[token] = tokens.get(token, 0) + count
return tokens
batch_texts = [texts[i:i+1000] for i in range(0, len(texts), 1000)]
results = parallel_apply(_tokenize_and_count, batch_texts, 10)
tokens = _total_count(results)
The entire process is: splitting the text into batches where each batch contains 1,000 texts; _tokenize_and_count is used to count each batch; _total_count summarizes the results of each batch; and finally, parallel_apply implements this process using 10 processes.
How long did this take? The result was 55 seconds! This represents a 20-fold speedup, meaning the speedup ratio per process is 2!
Why is it possible to achieve a speedup ratio greater than 1? In fact, the reason lies in the original single-process implementation where the line tokens[token] = tokens.get(token, 0) + 1 becomes slower and slower. As the counting progresses, there are more and more elements in tokens, and the addition, deletion, modification, and lookup operations within tokens gradually slow down.
In the multi-processing version, the tokens[token] = tokens.get(token, 0) + 1 line is only executed for batches of no more than 1,000 samples, which obviously maintains a very high speed consistently. Although the final result merging also involves frequent reads and writes to tokens, the frequency is far lower than in the original implementation, thus it is also very fast. Therefore, the multi-processing version can achieve a 20-fold acceleration rather than just the theoretical limit of 10-fold.
Of course, readers might have already sensed that this isn't truly the acceleration ratio exceeding 1, but rather an artifact caused by the poor design of the original single-process version. It can be fixed by changing it to the following code:
tokens = {}
for i in range(0, len(texts), 1000):
batch_texts = texts[i:i+1000]
batch_tokens = _tokenize_and_count(batch_texts)
for token, count in batch_tokens.items():
tokens[token] = tokens.get(token, 0) + count
This is the same approach of batch counting and then summarizing, except it's single-processed. This implementation looks roundabout and unintuitive, but it actually only took 8 minutes—about one-third of the original version! From this, it's evident that the actual speedup ratio is approximately 0.8.
This article briefly discussed the issue of multi-processing in Python and provided an example where the speedup ratio seemingly exceeds 1, followed by an analysis of the cause. From another perspective, this also serves as a reminder for writing similar code: even in a single-process scenario, the efficiency of batch calculation followed by summarization is usually higher than calculating the entire set in one go.