From SamplePairing to mixup: Magical Regularization Terms

By 苏剑林 | July 07, 2018

SamplePairing and mixup are two image data augmentation techniques that share the same origin. They appear quite unreasonable and are simple to operate, yet the results are very impressive: multiple image classification tasks have shown that they can improve the final accuracy of classification models.

Some readers might be confused by a question: why can such "unreasonable" data augmentation methods achieve such good results? This article aims to show that while they look like data augmentation methods, they are actually regularization schemes for the model. As the classic line from Stephen Chow's movie From Beijing with Love goes:

On the surface, it looks like a hair dryer, but it is actually a razor.

Data Augmentation

Let's start with data augmentation. Data augmentation refers to the fact that after we apply some simple transformations to original data, their corresponding categories often do not change. Therefore, we can "create" more data based on the original data. For example, for a photo of a dog, after operations like horizontal flipping, slight rotation, cropping, or translation, we still consider its category to be unchanged—it is still the same dog. In this way, several samples can be derived from one sample, thereby increasing the training sample size.

Dog

Rotated Dog

Data augmentation originates from our prior knowledge—for instance, we know in advance that ordinary photos possess horizontal flip invariance, so we use horizontal flipping to augment data, thereby telling the model it should be invariant to horizontal flipping. Data augmentation is not absolutely universal; while ordinary photos have horizontal flip invariance, text images do not. This further illustrates that data augmentation is a scheme for integrating our prior knowledge into the model.

SamplePairing

Now let's talk about SamplePairing. SamplePairing was proposed as a data augmentation method, but it is very counter-intuitive:

If the label of sample $x_a$ is $y_a$, then randomly select a sample $x_b$ from the training set, and let the label of $(x_a + x_b)/2$ still be $y_a$.

Yes, you read that correctly, and I didn't write it wrong; it's just that simple and that counter-intuitive. On paperweekly, reader Chen Taihong once commented:

This is the simplest paper I have ever seen in the CNN field.

After seeing this scheme, readers might have a series of questions. These questions represent exactly why SamplePairing is counter-intuitive. For example, why is the label of $(x_a + x_b)/2$ $y_a$ instead of $y_b$? Also, according to our understanding of data augmentation, $(x_a + x_b)/2$ is no longer even a reasonable image—can such "augmentation" still be used?

Cat
Cat Dog
Dog Cat-Dog-Average
Cat-Dog Average

First, regarding the issue of asymmetry: note that when the current sample is $x_a$ and $x_b$ is randomly selected, we set the label of $(x_a + x_b)/2$ to $y_a$. Conversely, when it is $x_b$'s turn, $x_a$ might be randomly selected, and then the label of $(x_a + x_b)/2$ will be $y_b$. So, while it looks asymmetric, SamplePairing is actually symmetric. SamplePairing can be re-described as:

Randomly select two samples $x_a$ and $x_b$ with labels $y_a$ and $y_b$. Randomly pick one of these labels, say $y$, then let the label of $(x_a + x_b)/2$ be $y$.

mixup

From the re-described version of SamplePairing, it's not hard to see that if training is sufficient, the model's output for $(x_a + x_b)/2$ theoretically should be half $y_a$ and half $y_b$. That is to say, if $y_a$ and $y_b$ represent the one-hot vectors of the categories, the output for $(x_a + x_b)/2$ would be $(y_a + y_b)/2$.

In that case, why not mix them in a more random way? Suppose $U(\varepsilon)$ is a random distribution on $[0,1]$. For each iteration, we randomly sample $\varepsilon \sim U(\varepsilon)$, and then let the output corresponding to $\varepsilon x_a + (1-\varepsilon) x_b$ be $\varepsilon y_a + (1-\varepsilon) y_b$. In the original paper, $U(\varepsilon)$ is taken as a $\beta$ distribution, but I think that introduces more hyperparameters; it might be better to simply let it be a uniform distribution.

Compared to SamplePairing, the approach of mixup is "softer" and more convincing. Because SamplePairing always feels like a "blunt cut," whereas mixup is relatively more intuitive: since the input is superimposed in the ratio $\varepsilon : 1-\varepsilon$, it is natural that the output should also be superimposed in the same way.

Regularization Interpretation

Although mixup feels more reasonable, both methods fail to answer a very important question: after two images are added, the result is no longer a reasonable "image." This is completely different from what we usually call data augmentation, so why is it still effective?

Let's describe this problem mathematically. For a training set of pairs $(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)$, we want to find a model $f$ such that $y = f(x)$. For tasks like image classification, given the strong non-linearity of the problem itself, we generally use very deep networks to fit the data. However, a deeper network also means it is more prone to overfitting the training set.

Suppose the model is already capable of predicting $y_a = f(x_a)$ and $y_b = f(x_b)$. For mixup, it says this is not enough; the model must also output $\varepsilon y_a + (1-\varepsilon) y_b$ for $\varepsilon x_a + (1-\varepsilon) x_b$:

\[\varepsilon y_a + (1-\varepsilon) y_b = f\big(\varepsilon x_a + (1-\varepsilon) x_b\big)\]

Substituting $y_a, y_b$ with $f(x_a), f(x_b)$, we get:

\[\varepsilon f(x_a) + (1-\varepsilon) f(x_b) = f\big(\varepsilon x_a + (1-\varepsilon) x_b\big)\]

This is actually a functional equation. If $\varepsilon, x_a, x_b$ are arbitrary, the solution to this functional equation is a "linear function." In other words, only a linear function can make the above equation hold identically. Put differently, mixup hopes the model $f$ is a linear function.

We know that a linear function is equivalent to a single-layer neural network without an activation function—arguably the simplest model. Our actual model, however, is a deep network with a huge number of parameters and strong non-linear capabilities, and more parameters lead to easier overfitting. In this light, the meaning of mixup becomes very clear:

mixup acts as a regularization term. It encourages the model to stay as close as possible to a linear function. That is to say, it ensures the model makes predictions as accurately as possible while keeping the model as simple as possible.

Thus, mixup is a powerful model filter:

Among all models with similar performance, choose the one closest to a linear function.

Epilogue

Now we have answered the original question: methods like SamplePairing and mixup only wear the mask of data augmentation. In reality, they use the form of data augmentation to add a regularization term to the model, or in other words, to prune the model.

Therefore, we don't need to dwell on the problem of "how data augmentation can be effective when the added images are no longer reasonable images," because it isn't actually data augmentation.

Finally, I'll conclude this article with a playful adaptation of a poem by Tang Bohu:

Others laugh at my wild augmentations,
I laugh at others for failing to see the truth.
No reasonable images are produced here,
Yet silently, the model is filtered through.