By 苏剑林 | February 19, 2017
The Process
In Python, if you want to perform multiprocessing calculations, it is generally implemented through multiprocessing. The most commonly used feature is the process pool in multiprocessing, for example:
Writing it this way is concise and clear, which is indeed convenient. Interestingly, you only need to replace multiprocessing with multiprocessing.dummy to change the program from multi-processing to multi-threading.
Objects
Python is an object-oriented programming language, and we often encapsulate certain programs into a class. However, within a class, the above method no longer works. For example:
The code looks quite natural, but it throws an error during execution:
cPickle.PicklingError: Can't pickle : attribute lookup __builtin__.function failed
However, if you replace multiprocessing with multiprocessing.dummy, no error is reported. Simply put, this is still due to the issue that variables cannot be shared between multiple processes, whereas multiple threads reside within the same process and naturally do not have this problem.
Drawing Inspiration
To study multiprocessing programming within objects, I made quite a few attempts. Later, I thought of how many modules in gensim support parallelism, so I decided to imitate them. Sure enough, I found ldamulticore.py. After repeatedly comparing and learning from online materials, I summarized a relatively concise, convenient, and universal way of writing it.
Like most multiprocessing programming, a Queue object needs to be established for communication between processes. The difference is that most online tutorials use the Process function from multiprocessing combined with loop statements to start multiple processes, while using Pool would fail (unless using multiprocessing.Manager.Queue, refer to this article). However, gensim uses a trick with Pool, allowing multiple processes to be started directly via Pool—the work of experts is truly different. The reference code is as follows:
In short, it involves establishing two Queues: one responsible for task queuing and the other for fetching results. What's quite magical is that Pool actually has a second and third parameter! For specific explanations, please see the official documentation regarding the Pool initialization function, which also runs in parallel automatically.
Note that after executing the line pool = Pool(4, f, (in_queue, out_queue)), the multiple processes start, but it does not wait for the processes to finish; instead, it immediately executes the subsequent statements. At this point, you can use pool.close() and pool.join() as before to let the processes complete before continuing. The solution used here, however, is to directly execute the statement that retrieves results and then determine whether the processes have finished through that process; once finished, the process pool is closed via pool.terminate(). This writing style is basically universal.