By 苏剑林 | August 24, 2021
As we know, gradient accumulation is a common technique for achieving large batch size training under limited GPU memory. In a previous article, "Trading Time for Quality: Keras Gradient Accumulation Optimizer" (Chinese), we briefly introduced the implementation of gradient accumulation. The general idea is to add a new set of parameters to cache gradients and then use the cached gradients to update the model. Unfortunately, adding a new set of parameters brings extra memory consumption. While thinking about optimizers these past few days, I suddenly realized: gradient accumulation can actually be built directly into optimizers with momentum! Based on this idea, I performed some derivations and experiments on the optimizers, eventually reaching an interesting but somewhat counter-intuitive conclusion: by updating parameters fewer times, the model's final performance might actually improve!
Before the formal discussion, we define the function
\begin{equation}\chi_{t/k} = \left\{ \begin{aligned}&1,\quad t \equiv 0\,(\text{mod}\, k) \\ &0,\quad t \not\equiv 0\,(\text{mod}\, k) \end{aligned}\right.\end{equation}That is to say, $t$ is an integer, and when it is a multiple of $k$, $\chi_{t/k}=1$; otherwise, $\chi_{t/k}=0$. This is essentially an indicator function for whether $t$ is divisible by $k$. In the following discussions, we will repeatedly use this function.
Now, let's discuss Stochastic Gradient Descent with Momentum (SGDM). Its general form is:
\begin{equation}\left\{\begin{aligned} m_t =&\, \beta m_{t-1} + \left(1 - \beta\right) g_t\\ \theta_t =&\, \theta_{t-1} - \alpha_t m_t \end{aligned}\right.\end{equation}Here $g_t$ is the gradient at step $t$, $\theta$ represents the parameters, and $\beta$ is the sliding coefficient (momentum decay). Suppose we accumulate gradients for $k$ steps; this technically means we only update every $k$ steps, and each update uses the average of the gradients from those $k$ steps, namely:
\begin{equation}\left\{\begin{aligned} m_{kt} =&\, \beta m_{k(t-1)} + \left(1 - \beta\right) \frac{1}{k}\sum_{i=1}^k g_{k(t-1) + i}\\ \theta_{kt} =&\, \theta_{k(t-1)} - \alpha_{kt} m_{kt} \end{aligned}\right.\end{equation}We can decompose the update of the momentum $m$:
\begin{equation}\begin{aligned} m_{k(t-1)+1} &\,= \beta m_{k(t-1)} + \frac{1}{k}\left(1 - \beta\right)g_{k(t-1) + 1}\\ m_{k(t-1)+2} &\,= m_{k(t-1) + 1} + \frac{1}{k}\left(1 - \beta\right)g_{k(t-1) + 2} \\ m_{k(t-1)+3} &\,= m_{k(t-1) + 2} + \frac{1}{k}\left(1 - \beta\right)g_{k(t-1) + 3} \\ &\,\,\,\vdots \\ m_{kt} &\,= m_{kt-1} + \frac{1}{k}\left(1 - \beta\right)g_{kt} \\ \end{aligned}\label{eq:m-part}\end{equation}Or write it as a general formula:
\begin{equation}m_t = \big[(\beta - 1)\chi_{(t - 1)/k} + 1\big] m_{t-1} + \frac{1}{k}\left(1 - \beta\right) g_t\end{equation}Similarly, the update of parameters $\theta$ can also be decomposed:
\begin{equation}\begin{aligned} \theta_{k(t-1)+1} &\,= \theta_{k(t-1)}\\ \theta_{k(t-1)+2} &\,= \theta_{k(t-1) + 1}\\ &\,\,\,\vdots \\ \theta_{kt-1} &\,= \theta_{kt-2}\\ \theta_{kt} &\,= \theta_{kt-1} - \alpha_{kt} m_{kt} \\ \end{aligned}\end{equation}Or written as a general formula:
\begin{equation}\theta_t = \theta_{t-1} - \chi_{t/k} \alpha_t m_t \end{equation}Therefore, for SGD with momentum, if one wants to accumulate gradients for $k$ steps, they only need to update in the following manner:
\begin{equation}\left\{\begin{aligned} m_t =&\, \big[(\beta - 1)\chi_{(t - 1)/k} + 1\big] m_{t-1} + \frac{1}{k}\left(1 - \beta\right) g_t\\ \theta_t =&\, \theta_{t-1} - \chi_{t/k} \alpha_t m_t \end{aligned}\right.\end{equation}This does not require introducing a new set of parameters.
For Adam, the update formulas are as follows:
\begin{equation}\left\{\begin{aligned} m_t =&\, \beta_1 m_{t-1} + \left(1 - \beta_1\right) g_t\\ v_t =&\, \beta_2 v_{t-1} + \left(1 - \beta_2\right) g_t^2\\ \hat{m}_t =&\, m_t\left/\left(1 - \beta_1^t\right)\right.\\ \hat{v}_t =&\, v_t\left/\left(1 - \beta_2^t\right)\right.\\ \theta_t =&\, \theta_{t-1} - \alpha_t \hat{m}_t\left/\sqrt{\hat{v}_t + \epsilon}\right. \end{aligned}\right.\end{equation}The handling of the first moment $m$ is consistent with the aforementioned SGDM, so the key lies in the second moment $v$. According to the definition, when accumulating gradients for $k$ steps, the update formula for $v$ is:
\begin{equation} v_{kt} = \beta_2 v_{k(t-1)} + \left(1 - \beta_2\right) \left(\frac{1}{k}\sum_{i=1}^k g_{k(t-1) + i}\right)^2\end{equation}Unfortunately, due to the existence of the square, $v$ cannot be decomposed in the same way as Equation $\eqref{eq:m-part}$. Therefore, strictly speaking, it is impossible to implement gradient accumulation for Adam without adding a set of cache parameters. However, if we assume that the "square of the average" can be estimated by the "average of the squares," i.e., assuming
\begin{equation}\left(\frac{1}{k}\sum_{i=1}^k g_{k(t-1) + i}\right)^2\sim \frac{1}{k}\sum_{i=1}^k g_{k(t-1) + i}^2\end{equation}Here $\sim$ represents a relatively tight linear correlation, but not necessarily an approximate equality; for example, they could differ by a constant multiple. Then we can still modify the formula for $v$ just like we did for $m$:
\begin{equation}v_t = \big[(\beta_2 - 1)\chi_{(t - 1)/k} + 1\big] v_{t-1} + \frac{1}{k}\left(1 - \beta_2\right) g_t^2\end{equation}Combining these, we get the modified Adam formula:
\begin{equation}\left\{\begin{aligned} m_t =&\, \big[(\beta_1 - 1)\chi_{(t - 1)/k} + 1\big] m_{t-1} + \frac{1}{k}\left(1 - \beta_1\right) g_t\\ v_t =&\, \big[(\beta_2 - 1)\chi_{(t - 1)/k} + 1\big] v_{t-1} + \frac{1}{k}\left(1 - \beta_2\right) g_t^2\\ \hat{m}_t =&\, m_t\left/\left(1 - \beta_1^{t/k}\right)\right.\\ \hat{v}_t =&\, v_t\left/\left(1 - \beta_2^{t/k}\right)\right.\\ \theta_t =&\, \theta_{t-1} - \chi_{t/k}\alpha_t \hat{m}_t\left/\sqrt{\hat{v}_t + \epsilon}\right. \end{aligned}\right.\end{equation}Through experiments, I found that Adam modified in this way indeed achieves similar effects to gradient accumulation.
In general, the two modified versions of the optimizers above primarily involve two changes:
1. Modifying the update formulas for $m$ and $v$;
2. Parameters are changed to update once every $k$ steps (or the learning rate is changed from $\alpha_t$ to $\chi_{t/k}\alpha_t$).
The update formulas for $m, v$ were changed to (without loss of generality, taking $m$ as an example):
\begin{equation}m_t = \big[(\beta - 1)\chi_{(t - 1)/k} + 1\big] m_{t-1} + \frac{1}{k}\left(1 - \beta\right) g_t\end{equation}If we want to map this back to the original format $m_t = \beta m_{t-1} + (1 - \beta) g_t$, we might guess that the above iteration is equivalent to replacing $\beta$ with $\tilde{\beta}=1 - \frac{1}{k}(1-\beta)$. In fact, this is indeed the case. We can prove that the exponential moving average momentum after replacing $\beta$ with $\tilde{\beta}=1 - \frac{1}{k}(1-\beta)$ is approximately equal to the exponential moving average momentum of $k$-step accumulated gradients with the original $\beta$:
\begin{equation}\begin{aligned} m_{kt} =&\, \tilde{\beta}m_{kt-1} + (1 - \tilde{\beta})g_{kt} \\ =&\, \tilde{\beta}^2 m_{kt-2} + \tilde{\beta}(1 - \tilde{\beta})g_{kt - 1} + (1 - \tilde{\beta})g_{kt}\\ =&\,\cdots\\ =&\, \tilde{\beta}^k m_{k(t-1)} + (1 - \tilde{\beta})\sum_{i=1}^k \tilde{\beta}^{i-1} g_{kt-i+1} \end{aligned}\end{equation}Since $\tilde{\beta}$ is typically very close to 1, we approximately assume that $\tilde{\beta}^{i-1}\approx 1, \forall i=1,2,\cdots,k$, and
\begin{equation}\tilde{\beta}^k = (1 - (1 - \tilde{\beta}))^k \approx 1 - k(1 - \tilde{\beta}) = \beta\end{equation}So we have the approximation:
\begin{equation}\begin{aligned} m_{kt} \approx &\, \beta m_{k(t-1)} + (1 - \tilde{\beta})\sum_{i=1}^k g_{kt-i+1} \\ = &\, \beta m_{k(t-1)} + (1 - \beta) \frac{1}{k}\sum_{i=1}^k g_{kt-i+1} \end{aligned}\end{equation}This is the form of gradient accumulation.
So, if you are facing performance issues caused by a small batch_size but don't want to implement gradient accumulation yourself or modify the optimizer, you can try the following operations:
1. Change all $\beta$ parameters to $1-\frac{1}{k}(1-\beta)$;
2. Multiply the learning rate by $\chi_{t/k}$, so that parameters are updated only once every $k$ steps.
Interestingly, I found that point 2 is more important than point 1: even if we don't modify $\beta$ and simply change the parameter update frequency to once every $k$ steps, the results improve (provided that the small batch size is the current bottleneck of the model). This is the "counter-intuitive" phenomenon mentioned in the title: without changing much, simply updating fewer times actually leads to better results.
This article introduced a discovery made while debugging models—gradient accumulation is hidden within the momentum of optimizers. Then, a further simple analysis and experiments were conducted, finding that near-equivalent gradient accumulation effects can be achieved by adjusting sliding coefficients and learning rates, leading to the discovery of the counter-intuitive phenomenon that "fewer updates can result in better performance."