Dropout Twice Again! This Time It Achieved SOTA on Supervised Tasks

By 苏剑林 | July 01, 2021

Readers who follow new developments in NLP are likely impressed by SimCSE released in April. By simply "applying Dropout twice" to construct positive samples for contrastive learning, it achieved across-the-board SOTA results in unsupervised semantic similarity tasks. Coincidentally, the recent paper "R-Drop: Regularized Dropout for Neural Networks" proposed R-Drop, which applies the idea of "Dropout twice" to supervised tasks. Almost every experimental result showed significant improvements. Furthermore, the author found in his own experiments that it also performs impressively in semi-supervised tasks.

The simple act of "Dropout twice" surprisingly yields a "decathlon-like" performance. This article introduces R-Drop and shares the author's thoughts on the underlying principles.

SimCSE

In "Are Chinese Tasks Still SOTA? We Added Some Experiments to SimCSE", we already introduced SimCSE. Briefly, SimCSE is a contrastive learning scheme for NLP. The standard contrastive learning workflow involves passing the same sample through different data augmentation methods to obtain a positive sample pair, while all other samples in the batch are treated as negative samples. The loss is then used to reduce the distance between positive samples and increase the distance between negative samples.

The main difficulty lies in the data augmentation methods. For NLP, it is difficult to manually construct data augmentations that guarantee semantic invariance. SimCSE avoids manual data augmentation by using "Dropout twice" to obtain different feature vectors for the same input, treating them as a positive sample pair. Curiously, this simple "Dropout twice" approach seems like a compromise, but ablation studies found it to be superior to almost all other data augmentation methods. It is truly a case of "great truths are often simple."

SimCSE Diagram

In implementation, SimCSE is quite simple. To "Dropout twice," you only need to repeatedly input the same sample into the model and calculate the corresponding loss, as shown in the figure above. Due to the randomness of Dropout, the Dropout pattern for each sample is different. Therefore, simply repeating the samples achieves the "Dropout twice" effect.

R-Drop

From the results, SimCSE aims for Dropout to not have an excessive impact on the model's output—that is, the model output should be robust to Dropout. Clearly, this "Dropout twice" idea can be generalized to general tasks, which is R-Drop (Regularized Dropout).

Classification Problems

In the author's view, R-Drop is highly related to SimCSE and was likely inspired by it. However, the R-Drop paper does not cite SimCSE, which is somewhat mysterious.

R-Drop Diagram

Taking a classification problem as an example, with training data $\{x_i, y_i\}_{i=1}^n$ and model $P_{\theta}(y|x)$, the loss for each sample is typically cross-entropy: \begin{equation}\mathcal{L}_i = -\log P_{\theta}(y_i|x_i)\end{equation} Under the "Dropout twice" scenario, we can consider the samples to have passed through two slightly different models, which we denote as $P_{\theta}^{(1)}(y|x)$ and $P_{\theta}^{(2)}(y|x)$. In this case, the R-Drop loss consists of two parts. One part is the standard cross-entropy: \begin{equation}\mathcal{L}_i^{(CE)} = -\log P_{\theta}^{(1)}(y_i|x_i) -\log P_{\theta}^{(2)}(y_i|x_i)\label{eq:ce}\end{equation} The other part is the symmetric KL divergence between the two models, which encourages the outputs of different Dropout versions of the model to be as consistent as possible: \begin{equation}\mathcal{L}_i^{(KL)} = \frac{1}{2}\big[KL\left(P_{\theta}^{(2)}(y|x_i)\big\Vert P_{\theta}^{(1)}(y|x_i)\right) + KL\left(P_{\theta}^{(1)}(y|x_i)\big\Vert P_{\theta}^{(2)}(y|x_i)\right)\big]\label{eq:kl}\end{equation} The final loss is a weighted sum of the two losses: \begin{equation}\mathcal{L}_i = \mathcal{L}_i^{(CE)} + \alpha\mathcal{L}_i^{(KL)}\end{equation} In other words, it adds a regularization term to the standard cross-entropy to strengthen model robustness.

General Form

Some readers might ask what the $KL$ term should be replaced with for non-classification problems. In fact, the original paper did not experiment with non-classification problems, but we can supplement it here. We can note that: \begin{equation}-\log P_{\theta}(y_i|x_i) = KL\left(\text{one\_hot}(y_i)\big\Vert P_{\theta}(y|x_i)\right)\end{equation} Thus, $\mathcal{L}_i$ is simply repeated use of $KL$ divergence. Its general form is: \begin{equation}\mathcal{L}_i = \mathcal{D}\left(y_i, f_{\theta}^{(1)}(x_i)\right)+\mathcal{D}\left(y_i, f_{\theta}^{(2)}(x_i)\right) + \frac{\alpha}{2} \left[\mathcal{D}\left(f_{\theta}^{(2)}(x_i), f_{\theta}^{(1)}(x_i)\right)+\mathcal{D}\left(f_{\theta}^{(1)}(x_i), f_{\theta}^{(2)}(x_i)\right)\right]\end{equation} Therefore, for non-classification problems, we replace $\mathcal{D}$ with an appropriate metric (instead of $KL$ divergence).

Experimental Results

Let's first look at the experimental results of R-Drop.

There are three main hyperparameters for R-Drop: batch_size, $\alpha$, and Dropout probability. The batch_size generally depends on computing power; for individuals, there isn't much room for adjustment. In the original paper, $\alpha$ ranges from $1$ to $5$. In the author's experiments, $\alpha=4$ was used without fine-tuning. For the Dropout probability, as chosen in "Are Chinese Tasks Still SOTA? We Added Some Experiments to SimCSE", a value of 0.3 yielded better results.

Paper Reports

To be honest, the performance of R-Drop reported in the original paper is quite stunning, which is the main reason why the author felt compelled to introduce it. The original paper conducted comparative experiments on various tasks including NLU, NLG, and CV classification. Most results showed "significant improvement."

Official Implementation: https://github.com/dropreg/R-Drop

Below are some screenshots of the experimental results:

R-Drop on MT
Effect of R-Drop on Machine Translation Tasks

Effect of R-Drop on GLUE Tasks

Notably, in machine translation tasks, the simple "Transformer + R-Drop" outperformed other more complex methods:

MT Comparisons
Comparison of different methods on Machine Translation tasks

The paper also includes experiments on automatic summarization, language models, image classification, etc., as well as ablation studies on hyperparameters. Readers are encouraged to read the original paper for details. Overall, R-Drop's "report card" is indeed worthy of praise.

Personal Attempts

Of course, the author maintains that "a model that hasn't been tested on Chinese tasks has no soul." Usually, I only share my findings after personally trying them on Chinese tasks.

Personal Implementation: https://github.com/bojone/r-drop

On Chinese supervised tasks, the author experimented with two text classification tasks (IFLYTEK and TNEWS from the CLUE benchmark):

\begin{array}{c|cc} \hline & \text{IFLYTEK} & \text{TNEWS} \\ \hline \text{No Adversarial Training} & 60.29\% & 56.58\% \\ \text{With Adversarial Training} & 62.46\% & 57.66\% \\ \text{With Gradient Penalty} & 62.31\% & \textbf{57.81%} \\ \text{With R-Drop} & \textbf{62.69%} & 57.51\% \\ \hline \end{array}

And one text generation task (CSL title generation, referring to "Analysis and Solutions for Exposure Bias in Seq2Seq"):

\begin{array}{c|cccc} \hline & \text{Rouge-L} & \text{Rouge-1} & \text{Rouge-2} & \text{BLEU} \\ \hline \text{baseline} & 63.81 & 65.45 & 54.91 & 45.52 \\ \text{Random Replacement} & 64.44 & 66.09 & 55.56 & 46.1 \\ \text{Gradient Penalty} & 65.41 & 67.29 & 56.64 & 47.37 \\ \text{R-Drop} & \textbf{65.51} & \textbf{67.41} & \textbf{57.12} & \textbf{47.82} \\ \hline \end{array}

As can be seen, R-Drop results are competitive with the famous regularization techniques "Adversarial Training" and "Gradient Penalty" introduced in "A Brief Discussion on Adversarial Training: Meaning, Methods, and Thoughts".

Implementation Details

Compared to complex regularization methods like adversarial training, the implementation of R-Drop is remarkably low-difficulty. Taking bert4keras as an example, here is a brief introduction to how to convert a standard training script to the R-Drop mode.

First, the data generation part is modified as follows:

class data_generator(DataGenerator):
    """Data Generator
    """
    def __iter__(self, random=False):
        batch_token_ids, batch_segment_ids, batch_labels = [], [], []
        for is_end, (text, label) in self.sample(random):
            token_ids, segment_ids = tokenizer.encode(text, maxlen=maxlen)
            # Repeat the current sample twice
            for i in range(2):
                batch_token_ids.append(token_ids)
                batch_segment_ids.append(segment_ids)
                batch_labels.append([label])
            if len(batch_token_ids) == self.batch_size * 2 or is_end:
                batch_token_ids = sequence_padding(batch_token_ids)
                batch_segment_ids = sequence_padding(batch_segment_ids)
                batch_labels = sequence_padding(batch_labels)
                yield [batch_token_ids, batch_segment_ids], batch_labels
                batch_token_ids, batch_segment_ids, batch_labels = [], [], []

Next, define a new loss function:

from keras.losses import kullback_leibler_divergence as kld

def categorical_crossentropy_with_rdrop(y_true, y_pred):
    """R-Drop Loss paired with the above generator.
    Note that dividing loss_kl by 4 aligns with the formula's quantitative description.
    """
    loss_ce = K.categorical_crossentropy(y_true, y_pred) # Original loss
    loss_kl = kld(y_pred[::2], y_pred[1::2]) + kld(y_pred[1::2], y_pred[::2])
    return K.mean(loss_ce) + K.mean(loss_kl) / 4 * alpha

Finally, enable the model's Dropout and use this data_generator and categorical_crossentropy_with_rdrop to train the model.

Personal Understanding

After reviewing the pleasant experimental results, let's dig into the theory. The original paper provides a theoretical analysis of R-Drop, roughly suggesting that R-Drop promotes assimilation of parameters, thereby acting as regularization. However, the author feels this explanation is not intuitive or fundamental enough. Below, I try to provide alternative perspectives for understanding R-Drop.

Consistency

R-Drop can be seen as an improvement on Dropout. So what's wrong with Dropout? Dropout is a classic example of a method where training and inference are inconsistent. Specifically, during training, Dropout adds multiplicative noise to (the inputs of certain layers), changing the model from $f_{\theta}(x)$ to $f_{\theta}(x, \varepsilon)$, where each element of $\varepsilon$ has a probability $p$ of being 0 and a probability $1-p$ of being $1/(1-p)$. The training objective is: \begin{equation}\mathbb{E}_{(x,y)\sim\mathcal{D}}\mathbb{E}_{\varepsilon}[l(y, f_{\theta}(x,\varepsilon))]\end{equation} After such training, which model is best to use for prediction? It's uncertain, but if the loss function is the $l_2$ distance, we can derive that the optimal prediction model should be: \begin{equation}\mathbb{E}_{\varepsilon}[f_{\theta}(x,\varepsilon)]\end{equation}

Derivation: If using $l_2$ loss, the loss for a single sample is: \begin{equation}\mathbb{E}_{\varepsilon}\left[\Vert y - f_{\theta}(x,\varepsilon)\Vert^2\right] = \Vert y\Vert^2 - 2\langle y,\mathbb{E}_{\varepsilon}\left[f_{\theta}(x,\varepsilon)\right]\rangle + \mathbb{E}_{\varepsilon}\left[\Vert f_{\theta}(x,\varepsilon)\Vert^2\right]\end{equation} Note that our question is "which function should be used for prediction after the model is trained," so $f_{\theta}(x,\varepsilon)$ is a constant, and $y$ is the variable to optimize. This is simply a quadratic minimization problem, easily solved as $y=\mathbb{E}_{\varepsilon}[f_{\theta}(x,\varepsilon)]$ when the loss is minimized.

Assuming this result generalizes. The above formula tells us the correct step for a model with Dropout is "model ensemble":

Pass the same input through the model multiple times (with Dropout active), then take the average of these predictions as the final result.

But our standard prediction method is clearly not like this; instead, we close Dropout for deterministic prediction, which is equivalent to the prediction model changing from "model averaging" to "weight averaging": \begin{equation}f_{\theta}(x,\mathbb{E}_{\varepsilon}[\varepsilon])=f_{\theta}(x,1)=f_{\theta}(x)\end{equation} Here $1$ refers to the all-ones vector. Therefore, we train an ensemble of different Dropout versions, but at prediction time we use a single model with Dropout turned off. These two are not necessarily equivalent—this is the training-inference inconsistency problem of Dropout.

Now, R-Drop becomes easy to understand. By adding a regularization term, it strengthens the model's robustness to Dropout, making the outputs under different Dropout masks essentially consistent. This reduces the inconsistency, promoting similarity between "model averaging" and "weight averaging," such that the effect of simply turning off Dropout is equivalent to the result of a multi-Dropout ensemble, thus improving final performance.

Continuity

The beginning of this article mentioned the similarity between R-Drop and SimCSE. In fact, it's also quite similar to "Virtual Adversarial Training (VAT)." (Though R-Drop didn't cite VAT either; is it just me who thinks they are similar??)

For an introduction to VAT, refer to my previous article "Random Thoughts on Generalization: From Random Noise, Gradient Penalty to Virtual Adversarial Training". Briefly, VAT also uses a regularization term to make the model more robust to perturbations, enhancing the model's own continuity (small changes shouldn't cause large changes in output). The difference lies in how perturbations are added: VAT only adds perturbations to the input and uses adversarial logic to make the perturbations more targeted; R-Drop's perturbations can be applied to every layer and are random.

Some readers might wonder: VAT is primarily for semi-supervised training, so does that mean R-Drop can also be used for semi-supervised training? The original paper didn't experiment with this, but I did. The answer is indeed yes. Similar to VAT, the added KL divergence term in R-Drop does not require labels, so it can be used for unsupervised training, and mixed with labeled data for semi-supervised learning. The performance is quite good. Below are my experimental results:

\begin{array}{c|cc} \hline & \text{Val Set} & \text{Test Set}\\ \hline \text{Non-VAT} & 88.93\% & 89.34\%\\ \text{VAT} & 89.83\% & \textbf{90.37\%}\\ \text{R-Drop} & \textbf{90.37\%} & 90.14\%\\ \hline \end{array}

As can be seen, R-Drop's semi-supervised performance is not inferior to VAT, and it is easier to implement and faster! It seems VAT might be ready to retire. Intuitively, although R-Drop's perturbations are random, because R-Drop introduces more points of perturbation (every layer), the cumulative perturbation becomes amplified, potentially matching the effect of VAT's adversarially optimized perturbations. This explains why R-Drop can hold its own against VAT.

Non-Target Classes

A direct question is: if my model is complex enough, can't the cross-entropy term alone make the model robust to Dropout? What specific difference does the KL divergence term make?

In fact, it really can't. It is important to note that the training objective of cross-entropy is primarily: to make the score of the target class greater than the scores of non-target classes. This allows the model to correctly predict the target class (refer to "Generalizing 'Softmax + Cross Entropy' to Multi-label Classification"). In other words, if there is only cross-entropy, the training result at most ensures:

Under different Dropouts, the target class score is always greater than the non-target class scores.

But it does not ensure:

Under different Dropouts, the scores for *every* class are consistent.

Therefore, it doesn't solve the training-inference inconsistency problem. From a formulaic perspective, cross-entropy $\eqref{eq:ce}$ only relates to the target class and doesn't care about the distribution of non-target classes. If the target class is the first class, a prediction of $[0.5, 0.2, 0.3]$ or $[0.5, 0.3, 0.2]$ makes no difference to it. But for the KL divergence term $\eqref{eq:kl}$, scores for every class are involved in the calculation, and there is a non-zero loss between $[0.5, 0.2, 0.3]$ and $[0.5, 0.3, 0.2]$.

Summary

This article introduced R-Drop, which applies the "Dropout twice" idea to supervised tasks, achieving noticeable improvements across various experiments. Additionally, the author found it performs impressively in semi-supervised tasks. Finally, I shared three perspectives for understanding R-Drop.