By 苏剑林 | June 17, 2021
In the previous article, "Trading Time for Performance: Keras Gradient Accumulation Optimizer," we introduced "gradient accumulation," a technique for achieving the effect of a large batch size under limited GPU memory. Generally speaking, gradient accumulation is suitable for scenarios where the loss is independent and identically distributed (i.i.d.). In other words, the loss for each sample is calculated independently, and the total loss is the average or sum of all individual losses. However, not all tasks satisfy this condition. For instance, in the recently popular contrastive learning, the loss for each sample also depends on other samples. So, in the context of contrastive learning, can we still use gradient accumulation to achieve the effect of a large batch size? This article analyzes this question.
In general, the loss for contrastive learning can be written as:
\begin{equation}\mathcal{L}=-\sum_{i,j=1}^b t_{i,j}\log p_{i,j} = -\sum_{i,j=1}^b t_{i,j}\log \frac{e^{s_{i,j}}}{\sum\limits_j e^{s_{i,j}}}=-\sum_{i,j=1}^b t_{i,j}s_{i,j} + \sum_{i=1}^b \log\sum_{j=1}^b e^{s_{i,j}}\label{eq:loss}\end{equation}Here $b$ is the batch size; $t_{i,j}$ is a pre-given label satisfying $t_{i,j}=t_{j,i}$. It is a one-hot matrix where each column (and row) has only one 1, and the rest are 0. $s_{i,j}$ is the similarity between sample $i$ and sample $j$, satisfying $s_{i,j}=s_{j,i}$. Usually, there is a temperature parameter, but here we assume the temperature parameter has been integrated into $s_{i,j}$ to simplify notation. The model parameters exist within $s_{i,j}$ and are denoted as $\theta$.
It can be verified that, in general:
\begin{equation}-\sum_{i,j=1}^{2b} t_{i,j}\log p_{i,j} \neq -\sum_{i,j=1}^{b} t_{i,j}\log p_{i,j}-\sum_{i,j=b+1}^{2b} t_{i,j}\log p_{i,j}\end{equation}Therefore, simply accumulating the gradients of contrastive learning with a small batch size is not equivalent to contrastive learning with a large batch size. Similar problems exist in models with Batch Normalization (BN).
Note that we just said conventional simple gradient accumulation is not equivalent, but it is possible that a slightly more complex accumulation scheme exists. To find out, let's analyze the gradient of Equation $\eqref{eq:loss}$:
\begin{equation}\begin{aligned} \nabla_{\theta}\mathcal{L} =&\, -\sum_{i,j=1}^b t_{i,j}\nabla_{\theta}s_{i,j} + \sum_{i=1}^b \nabla_{\theta}\log\sum_{j=1}^b e^{s_{i,j}} \\ =&\, -\sum_{i,j=1}^b t_{i,j}\nabla_{\theta}s_{i,j} + \sum_{i,j=1}^b p_{i,j}\nabla_{\theta} s_{i,j} \\ =&\,\nabla_{\theta}\sum_{i,j=1}^b \left(p_{i,j}^{(sg)} - t_{i,j}\right)s_{i,j} \end{aligned}\end{equation}Where $p_{i,j}^{(sg)}$ indicates that the gradient with respect to $\theta$ for $p_{i,j}$ is not required—this is the stop_gradient operator in deep learning frameworks. The equation above shows that if we use a gradient-based optimizer, using Equation $\eqref{eq:loss}$ as the loss is completely equivalent to using $\sum\limits_{i,j=1}^b \left(p_{i,j}^{(sg)} - t_{i,j}\right)s_{i,j}$ as the loss (because the resulting gradients are exactly the same).
Next, consider the calculation of $\nabla_{\theta}s_{i,j}$. Generally, it takes the form of a vector inner product, i.e., $s_{i,j}=\langle h_i, h_j\rangle$, where the parameter $\theta$ is contained within $h_i$ and $h_j$. In this case:
\begin{equation}\nabla_{\theta}s_{i,j}=\langle \nabla_{\theta}h_i, h_j\rangle + \langle h_i, \nabla_{\theta}h_j\rangle = \nabla_{\theta}\left(\langle h_i, h_j^{(sg)}\rangle + \langle h_i^{(sg)}, h_j\rangle\right)\end{equation}Therefore, $s_{i,j}$ in the loss can be replaced by $\langle h_i, h_j^{(sg)}\rangle + \langle h_i^{(sg)}, h_j\rangle$ without changing the outcome:
\begin{equation}\begin{aligned} \nabla_{\theta}\sum_{i,j=1}^b \left(p_{i,j}^{(sg)} - t_{i,j}\right)s_{i,j} =&\, \nabla_{\theta}\sum_{i,j=1}^b \left(p_{i,j}^{(sg)} - t_{i,j}\right)\left(\langle h_i, h_j^{(sg)}\rangle + \langle h_i^{(sg)}, h_j\rangle\right)\\ =&\, 2\nabla_{\theta}\sum_{i,j=1}^b \left(\overline{p_{i,j}^{(sg)}} - t_{i,j}\right)\langle h_i, h_j^{(sg)}\rangle\\ =&\,\nabla_{\theta}\sum_{i=1}^b \left\langle h_i, 2\sum_{j=1}^b\left(\overline{p_{i,j}^{(sg)}} - t_{i,j}\right)h_j^{(sg)}\right\rangle \end{aligned}\label{eq:g}\end{equation}Where $2\overline{p_{i,j}^{(sg)}}=p_{i,j}^{(sg)} + p_{j,i}^{(sg)}$. The second equal sign comes from swapping the summation indices $i,j$ for the $\langle h_i^{(sg)}, h_j\rangle$ term, which does not change the summation result.
Equation $\eqref{eq:g}$ actually provides the final solution, which can be divided into two steps. The first step is the calculation of the vector:
\begin{equation}\tilde{h}_i = 2\sum_{j=1}^b\left(\overline{p_{i,j}^{(sg)}} - t_{i,j}\right)h_j^{(sg)}\label{eq:h}\end{equation}This step does not require gradient calculation; it is a purely predictive process, so the batch size can be relatively large. The second step is to pass $\tilde{h}_i$ into the model as a "label" and optimize the model using $\langle h_i, \tilde{h}_i\rangle$ as the loss for a single sample. This step requires gradient calculation, but it has been transformed into a form of the sum of gradients for each sample, so conventional gradient accumulation can be used at this point.
Suppose the maximum batch size for backpropagation is $b$, and the maximum batch size for forward propagation is $nb$. To achieve the effect of a batch size of $nb$ for contrastive learning through gradient accumulation, the formalized workflow is as follows:
Overall, in terms of computation, this requires one extra forward pass compared to conventional gradient accumulation. Of course, if the maximum batch size for a single forward pass still doesn't meet our needs, the forward pass can also be calculated in batches, since we only need to compute and store each $\{h_i\}_{i=1}^{nb}$, and $\{p_{i,j}\}_{i,j=1}^{nb}$ can be calculated based on $\{h_i\}_{i=1}^{nb}$.
Finally, it is important to remind that the above workflow is only equivalent to a large batch size model during optimization—that is, the gradient of $\langle h_i, \tilde{h}_i\rangle$ is equivalent to the gradient of the original loss. However, its value is not equal to the value of the original loss. Therefore, $\langle h_i, \tilde{h}_i\rangle$ cannot be used as a loss to evaluate the model; it is not necessarily monotonic, nor necessarily non-negative, and it may not even have a strict correlation with the original loss.
The above workflow shares the same problem as the "recomputation" technique introduced in "A Keras Version of Memory-Saving Recomputation Techniques," which is that it is incompatible with Dropout. This is because each update involves multiple forward passes, and each forward pass has a different Dropout mask. This means the $h_i$ used when calculating the label vector $\tilde{h}_i$ is not the same as the $h_i$ used when calculating the gradient, resulting in the calculation of a gradient that is not the most reasonable one.
There is no good solution for this; the simplest and most effective method is to remove Dropout from the model. This is generally not a problem for CV (Computer Vision), as CV models have largely moved away from Dropout. For NLP, the first thing that comes to mind is that SimCSE cannot use gradient accumulation, because Dropout is the foundation of SimCSE~
This article analyzed the gradient accumulation method for contrastive learning. The results show that contrastive learning can use gradient accumulation, but it requires an additional forward pass and the removal of Dropout from the model. The same approach in this article can also be used to analyze how BN can use gradient accumulation; interested readers may wish to try it out.