By 苏剑林 | November 29, 2021
As everyone knows, the inconsistency of BERT's MLM (Masked Language Model) task between pre-training and fine-tuning—specifically, the appearance of [MASK] during pre-training and its absence during downstream fine-tuning—is a frequently criticized issue. Many works believe this is a significant factor affecting BERT's fine-tuning performance and have proposed improvements such as XL-NET, ELECTRA, and MacBERT. In this article, we will analyze this inconsistency of MLM from the perspective of Dropout and propose a simple operation to correct it.
A similar analysis can be applied to the MAE (Masked Autoencoder) model recently proposed by Kaiming He. The results show that MAE indeed possesses better consistency compared to MLM, from which we can derive a regularization technique that may accelerate training speed.
Dropout
First, let's revisit Dropout. From a mathematical standpoint, Dropout is an operation that introduces random noise into a model via a Bernoulli distribution. Let's briefly review the Bernoulli distribution.
Bernoulli Distribution
The Bernoulli distribution is perhaps the simplest probability distribution. it is a binary distribution with a value space of $\{0,1\}$, where the probability of $\varepsilon$ taking 1 is $p$, and the probability of taking 0 is $1-p$, denoted as:
\begin{equation}\varepsilon\sim \text{Bernoulli}(p)\end{equation}
An interesting property of the Bernoulli distribution is that all its moments are $p$, i.e.,
\begin{equation}\mathbb{E}_{\varepsilon}[\varepsilon^n] = p\times 1^n + (1-p)\times 0^n = p\end{equation}
Thus, we know its mean is $p$ and its variance is:
\begin{equation}\mathbb{V}ar_{\varepsilon}[\varepsilon] = \mathbb{E}_{\varepsilon}[\varepsilon^2] - \mathbb{E}_{\varepsilon}[\varepsilon]^2 = p(1-p)\end{equation}
Training and Prediction
During the training phase, Dropout sets certain values to zero with probability $1-p$ and divides the remaining values by $p$. Thus, Dropout effectively introduces a random variable $\varepsilon\sim \text{Bernoulli}(p)$, transforming the model from $f(x)$ to $f(x\varepsilon/p)$. While $\varepsilon$ can have multiple components corresponding to independent Bernoulli distributions, in most cases, the result is fundamentally no different from when $\varepsilon$ is a scalar, so we only need to derive for the scalar case.
In "Dropout Twice Again! This Time It Achieved SOTA in Supervised Tasks", we proved that if the loss function is MSE, then the optimal prediction model after training is completed should be:
\begin{equation}\mathbb{E}_{\varepsilon}[f(x\varepsilon/p)]\end{equation}
This means we should make multiple predictions without turning off Dropout and then average the results to serve as the final prediction, i.e., "model averaging." However, this is computationally expensive. In practice, we rarely do this; instead, we usually turn off Dropout, which means changing $\varepsilon/p$ to 1. Since we know:
\begin{equation}f(x)=f(x\,\mathbb{E}_{\varepsilon}[\varepsilon]/p)\end{equation}
turning off Dropout is effectively a "weight averaging" approach (viewing $\varepsilon$ as the model's random weights). In other words, while the theoretical optimal solution is "model averaging," due to computational constraints, we usually approximate it with "weight averaging," which can be seen as a first-order approximation of "model averaging."
MLM Model
In this section, we treat the MLM model as a special type of Dropout. This allow us to clearly describe the inconsistency between pre-training and fine-tuning and derive a simple correction strategy to better alleviate this inconsistency.
Dropout Perspective
For simplicity, let's first analyze a simplified version of MLM: assume that during the pre-training phase, each token remains unchanged with probability $p$ and is replaced by [MASK] with probability $1-p$. Let the embedding of the $i$-th token be $x_i$ and the embedding of [MASK] be $m$. We can similarly introduce a random variable $\varepsilon\sim \text{Bernoulli}(p)$ and denote the MLM model as:
\begin{equation}f(\cdots,x_i,\cdots)\quad\rightarrow\quad f(\cdots,x_i \varepsilon + m(1-\varepsilon),\cdots)\end{equation}
In this way, MLM is essentially the same as Dropout; both introduce random perturbations to the model through a Bernoulli distribution. Now, following the standard usage of Dropout, the prediction model should use "weight averaging," namely:
\begin{equation}f(\cdots,\mathbb{E}_{\varepsilon}[x_i \varepsilon + m(1-\varepsilon)],\cdots) = f(\cdots,x_i p + m (1-p),\cdots)\end{equation}
At this point, the inconsistency of MLM during the fine-tuning phase becomes apparent: if we treat pre-trained MLM as a special Dropout, fine-tuning corresponds to "disabling Dropout." According to standard practice, we should change each token's embedding to $x_i p + m (1-p)$, but in reality, we do not; we retain the original $x_i$.
Correcting Embedding
According to BERT's default settings, during MLM training, 15% of tokens are selected for prediction. Among these 15%, there is an 80% chance of being replaced by [MASK], a 10% chance of remaining unchanged, and a 10% chance of being replaced by a random token. Based on the previous analysis, after MLM pre-training is complete, we should adjust the Embedding as follows:
\begin{equation}\text{Embedding[i]} \leftarrow 0.85\times \text{Embedding[i]} + 0.15\times\left(\begin{array}{l}0.8\times \text{Embedding[m]} \,+\\ 0.1 \times \text{Embedding[i]} \,+ \\ 0.1\times \text{Avg[Embedding]}\end{array}\right)
\end{equation}
Where $\text{Embedding[m]}$ is the embedding of [MASK], and $\text{Avg[Embedding]}$ is the average embedding of all tokens. In bert4keras, the reference code is as follows:
embeddings = model.get_weights()[0] # Generally, the first weight is the Token Embedding
v1 = embeddings[tokenizer._token_mask_id][None] # Embedding of [MASK]
v2 = embeddings.mean(0)[None] # Average Embedding
embeddings = 0.85 * embeddings + 0.15 * (0.8 * v1 + 0.1 * embeddings + 0.1 * v2) # Weighted average
K.set_value(model.weights[0], embeddings) # Re-assign
Does this modification provide the expected improvement? I compared the experimental results of BERT and RoBERTa on CLUE before and after the modification (baseline code refers to "bert4keras in hand, baseline I have: CLUE Benchmark Code"). The conclusion is: "No significant change."
Reading this, you might feel disappointed: Does this mean everything said earlier was in vain? I believe that human intervention can indeed alleviate the inconsistency (otherwise, wouldn't we be negating Dropout?). The fact that the results showed no improvement suggests that the inconsistency problem might not be as severe as we imagined, at least for CLUE tasks. A similar result appeared in MacBERT, which uses synonyms to replace [MASK] to correct this inconsistency, but I tested MacBERT using the same baseline code, and it showed no significant difference from RoBERTa. Therefore, it may be that only in specific tasks or with larger mask ratios will the necessity of correcting this inconsistency be demonstrated.
MAE Model
Many readers may have heard of the MAE (Masked Autoencoder) model recently proposed by Kaiming He. It introduces the MLM task into image pre-training in a simple and efficient way and has achieved effective improvements. In this section, we will see that MAE can also be understood as a special kind of Dropout, which leads to a new method for preventing overfitting.
Dropout Perspective
As shown in the figure below, MAE divides the model into an encoder and a decoder, characterized by a "deep encoder, shallow decoder." It places [MASK] only in the decoder, while the encoder does not process [MASK]. Consequently, the sequence the encoder processes becomes shorter. Crucially, MAE uses a 75% mask ratio, meaning the encoder's sequence length is only 1/4 of the usual length. Combined with the "deep encoder, shallow decoder" characteristic, the overall pre-training speed of the model is increased by more than three times!

We can also implement the MAE model from another perspective: MAE's removal of [MASK] from the encoder is equivalent to the remaining tokens not interacting with the masked tokens. For a Transformer model, token interaction comes from Self Attention. Thus, we can still maintain the original input but mask the corresponding columns in the Attention matrix. As shown in the figure, if the $i$-th token is masked, it is essentially equivalent to forcing all elements in the $i$-th column of the Attention matrix to 0:

Of course, from a practical standpoint, this approach is a waste of computation, but it helps us derive an interesting theoretical result. Let us have $n$ input tokens and the original Attention matrix be $A$ (after softmax). Define $M_i$ as an $n\times n$ matrix where the $i$-th column is 0 and the rest are 1. Then define a random matrix $\tilde{M}_i$, which is an all-ones matrix with probability $p$ and $M_i$ with probability $1-p$. The MAE model can then be written as:
\begin{equation}f(\cdots,A,\cdots)\quad\rightarrow\quad f(\cdots,\text{Norm}(A\otimes \tilde{M}_1\otimes \tilde{M}_2\otimes \cdots\otimes \tilde{M}_n),\cdots)\end{equation}
Here $\text{Norm}$ refers to re-normalizing the matrix by row; $\otimes$ represents element-wise multiplication; when there are multiple Attention layers, all layers share the same $\tilde{M}_1, \tilde{M}_2, \cdots, \tilde{M}_n$.
In this way, we have converted MAE into a special kind of Attention Dropout. Then, following the standard practice of "disabling Dropout" during fine-tuning, the corresponding model should be:
\begin{align}
&\,f(\cdots,\text{Norm}(A\otimes \mathbb{E}[\tilde{M}_1\otimes \tilde{M}_2\otimes \cdots\otimes \tilde{M}_n]),\cdots)\nonumber\\
=&\,f(\cdots,\text{Norm}(A\otimes \mathbb{E}[\tilde{M}_1]\otimes \mathbb{E}[\tilde{M}_2]\otimes \cdots\otimes \mathbb{E}[\tilde{M}_n]),\cdots)\nonumber\\
=&\,f(\cdots,\text{Norm}(Ap),\cdots)\nonumber\\
=&\,f(\cdots,A,\cdots)
\end{align}
The second equality is because $\mathbb{E}[\tilde{M}_i]$ is a matrix where the $i$-th column is $p$ and the rest are 1, so $\mathbb{E}[\tilde{M}_1]\otimes \mathbb{E}[\tilde{M}_2]\otimes \cdots\otimes \mathbb{E}[\tilde{M}_n]$ is effectively an all-$p$ matrix. Multiplying $A$ by this is equivalent to multiplying $A$ directly by the constant $p$. The third equality holds because multiplying all elements by the same constant does not affect the row-wise normalization results.
From this result, we see that for MAE, "disabling Dropout" results in the same model as the original. This indicates that MAE, compared to the original MLM model, offers not only a speed improvement but also better consistency between pre-training and fine-tuning.
Preventing Overfitting
Conversely, since MAE can also be viewed as a type of Dropout, and Dropout helps prevent overfitting, can we use the MAE approach as a regularization method to prevent overfitting? As shown in the figure below, during the training phase, we can randomly drop some tokens while maintaining the original positions of the remaining tokens. We will tentatively call this "DropToken":

The reasoning behind this is that while conventional Dropout is often intuitively understood as sampling a sub-network for training, that is purely conceptual; in reality, Dropout increases training time. DropToken, by explicitly shortening sequence lengths, can increase training speed. If effective, it would be a very practical technique. Furthermore, some readers may have tried data augmentation by deleting certain words; the difference with DropToken is that although some tokens are deleted, the original positions of the remaining tokens are preserved, an implementation made possible by the Transformer architecture itself.
Several comparative experiments conducted on CLUE using BERT-base as the baseline model (subscripts indicate the drop ratio) showed mixed results. Except for IFLYTEK which showed clear benefits, the others were hit-or-miss (as is often the case with many regularization methods). The optimal drop ratio was found to be between 0.1 and 0.15:
\begin{array}{c}
\text{CLUE Classification Task Comparison Experiments (Validation Set)} \\
{\begin{array}{|c|cccccc|}
\hline
& \text{IFLYTEK} & \text{TNEWS} & \text{AFQMC} & \text{OCNLI} & \text{WSC} & \text{CSL} \\
\hline
\text{BERT}_{0.00} & 60.06 & 56.80 & 72.41 & 73.93 & 78.62 & 83.93 \\
\text{BERT}_{0.10} & 60.56 & 57.00 & 72.61 & 73.76 & 77.30 & 83.33\\
\text{BERT}_{0.15} & 60.10 & 56.68 & 72.50 & 74.54 & 77.30 & 83.30\\
\text{BERT}_{0.25} & 61.29 & 56.88 & 72.34 & 73.09 & 73.68 & 83.37\\
\text{BERT}_{0.50} & 61.45 & 57.02 & 69.76 & 70.68 & 69.41 & 82.56\\
\hline
\end{array}}
\end{array}
Conclusion
This article explored the MLM and MAE models from a Dropout perspective. Both can be viewed as special forms of Dropout. From this viewpoint, we obtained a technique to correct MLM inconsistency and a DropToken technique similar to MAE for preventing overfitting.