By 苏剑林 | September 08, 2021
In this article, we will tackle a programming problem:
How to globally shuffle hundreds of gigabytes of text files randomly under limited memory?
The background of the problem is quite clear: modern pre-trained models often use dozens or even hundreds of gigabytes of corpora. To enable the model to pre-train better, it is necessary to perform a global random shuffle of the training data. However, for many people, a few hundred GBs of data is often larger than their available RAM. Therefore, how to achieve a global random shuffle within limited memory is a problem worth studying.
Assuming our file is stored line-by-line, meaning one line represents one sample, our goal is to randomly shuffle the file by lines. If we only have one file and its size is significantly smaller than the memory, we can use the shuf command that comes with Linux:
shuf input.txt -o output.txt
The reason for emphasizing that the file size must be significantly smaller than memory is that shuf loads the entire file into memory before shuffling, which requires us to have enough RAM. To address this, an improved version called terashuf (Link) exists. It uses disk space to substitute for memory by splitting files, allowing us to shuffle files larger than the memory.
It seems that terashuf fully meets our needs. Theoretically, it does. However, sometimes we might have many personalized requirements, such as mixing multiple files together for a random shuffle, or splitting the shuffled output into multiple files, and so on. Therefore, it is better if we can implement it ourselves in Python to handle more complex custom requirements.
Now let's look at what the algorithm for global shuffling looks like under limited memory. Broadly speaking, the steps are as follows:
1. Suppose the file has a total of $mn$ lines; split it into $m$ files, with each file containing $n$ lines;
2. Randomly shuffle the interior of each $n$-line file. Since $n$ can be arbitrarily specified, this step can be completed in memory;
3. Read the first line of each file (obtaining $m$ lines of data) and write these $m$ lines randomly into the output file;
4. Sequentially read the $2, \cdots, n$ lines of each file and repeat step 3.
Simply put, you first shuffle vertically and then shuffle horizontally to get a sufficiently shuffled result that approaches a global shuffle, as shown in the figure below:

Left: Original data; Middle: Internal vertical shuffle for each column; Right: Based on the vertical shuffle, each row is horizontally shuffled.
Note that this algorithm only guarantees a result that is as "messy" as possible, but it cannot guarantee that every possible permutation can occur. For example, the first two samples of the output cannot both happen to be the first two samples of the first file. To truly achieve equal probability for all permutations, we would need to sample based on the remaining number of lines in each file at every step, rather than reading the $k$-th line of every file at the same time and then reading the $(k+1)$-th line. However, sampling based on probability at every step increases the sampling cost, and for practical use, the difference in effect is not significant, so it is not particularly necessary.
In practice, when splitting files by $n$ lines per file, the last file might have fewer than $n$ lines. If we don't care about this small detail, we can still follow the process above. When the last file finishes reading first, just continue returning empty lines; the whole process will not error out. Of course, this will cause the samples from the last file to be ordered slightly towards the front. For readers with "perfectionism" who find this unacceptable, you can consider introducing rejection sampling when reading the last file, with a rejection rate of $1-\frac{\text{remaining lines in the last file}}{\text{remaining lines in each of the others}}$.
Reference Implementation (without rejection sampling):
Github: https://github.com/bojone/shuffle
If it is just a combination of merging, shuffling, and splitting operations, it can actually be achieved using shell commands plus terashuf. The approximate command would be:
cat corpus/*.json | TMPDIR=/root/tmp MEMORY=20 ./terashuf | split -l 100000 -a 5 -d - corpus-
By comparison, in the same environment, for files with a total size of about 280GB, the total time for shuffling using terashuf is roughly 2.7 hours, while the Python code written by the author takes about 3.5 hours. It seems the Python code is not that much slower; it is acceptable, after all, terashuf is written in C++, and it's not embarrassing for Python to be slower than C++.
In most cases, the bottleneck for this global shuffling algorithm is the disk I/O speed. Therefore, multi-processing/multi-threading are generally not particularly helpful.
This article briefly introduced the idea of using disk space to perform global shuffling of large files under limited memory and provided a Python implementation. Readers in need can derive codes for more complex requirements themselves.