Gradient Accumulation Hidden in Momentum: Updating Fewer Steps Might Lead to Better Results?

By 苏剑林 | August 24, 2021

As we know, gradient accumulation is a common technique for achieving large batch size training under limited GPU memory. In a previous article, "Trading Time for Quality: Keras Gradient Accumulation Optimizer" (Chinese), we briefly introduced the implementation of gradient accumulation. The general idea is to add a new set of parameters to cache gradients and then use the cached gradients to update the model. Unfortunately, adding a new set of parameters brings extra memory consumption. While thinking about optimizers these past few days, I suddenly realized: gradient accumulation can actually be built directly into optimizers with momentum! Based on this idea, I performed some derivations and experiments on the optimizers, eventually reaching an interesting but somewhat counter-intuitive conclusion: by updating parameters fewer times, the model's final performance might actually improve!

SGDM

Before the formal discussion, we define the function

\begin{equation}\chi_{t/k} = \left\{ \begin{aligned}&1,\quad t \equiv 0\,(\text{mod}\, k) \\ &0,\quad t \not\equiv 0\,(\text{mod}\, k) \end{aligned}\right.\end{equation}

That is to say, $t$ is an integer, and when it is a multiple of $k$, $\chi_{t/k}=1$; otherwise, $\chi_{t/k}=0$. This is essentially an indicator function for whether $t$ is divisible by $k$. In the following discussions, we will repeatedly use this function.

Now, let's discuss Stochastic Gradient Descent with Momentum (SGDM). Its general form is:

\begin{equation}\left\{\begin{aligned} m_t =&\, \beta m_{t-1} + \left(1 - \beta\right) g_t\\ \theta_t =&\, \theta_{t-1} - \alpha_t m_t \end{aligned}\right.\end{equation}

Here $g_t$ is the gradient at step $t$, $\theta$ represents the parameters, and $\beta$ is the sliding coefficient (momentum decay). Suppose we accumulate gradients for $k$ steps; this technically means we only update every $k$ steps, and each update uses the average of the gradients from those $k$ steps, namely:

\begin{equation}\left\{\begin{aligned} m_{kt} =&\, \beta m_{k(t-1)} + \left(1 - \beta\right) \frac{1}{k}\sum_{i=1}^k g_{k(t-1) + i}\\ \theta_{kt} =&\, \theta_{k(t-1)} - \alpha_{kt} m_{kt} \end{aligned}\right.\end{equation}

We can decompose the update of the momentum $m$:

\begin{equation}\begin{aligned} m_{k(t-1)+1} &\,= \beta m_{k(t-1)} + \frac{1}{k}\left(1 - \beta\right)g_{k(t-1) + 1}\\ m_{k(t-1)+2} &\,= m_{k(t-1) + 1} + \frac{1}{k}\left(1 - \beta\right)g_{k(t-1) + 2} \\ m_{k(t-1)+3} &\,= m_{k(t-1) + 2} + \frac{1}{k}\left(1 - \beta\right)g_{k(t-1) + 3} \\ &\,\,\,\vdots \\ m_{kt} &\,= m_{kt-1} + \frac{1}{k}\left(1 - \beta\right)g_{kt} \\ \end{aligned}\label{eq:m-part}\end{equation}

Or write it as a general formula:

\begin{equation}m_t = \big[(\beta - 1)\chi_{(t - 1)/k} + 1\big] m_{t-1} + \frac{1}{k}\left(1 - \beta\right) g_t\end{equation}

Similarly, the update of parameters $\theta$ can also be decomposed:

\begin{equation}\begin{aligned} \theta_{k(t-1)+1} &\,= \theta_{k(t-1)}\\ \theta_{k(t-1)+2} &\,= \theta_{k(t-1) + 1}\\ &\,\,\,\vdots \\ \theta_{kt-1} &\,= \theta_{kt-2}\\ \theta_{kt} &\,= \theta_{kt-1} - \alpha_{kt} m_{kt} \\ \end{aligned}\end{equation}

Or written as a general formula:

\begin{equation}\theta_t = \theta_{t-1} - \chi_{t/k} \alpha_t m_t \end{equation}

Therefore, for SGD with momentum, if one wants to accumulate gradients for $k$ steps, they only need to update in the following manner:

\begin{equation}\left\{\begin{aligned} m_t =&\, \big[(\beta - 1)\chi_{(t - 1)/k} + 1\big] m_{t-1} + \frac{1}{k}\left(1 - \beta\right) g_t\\ \theta_t =&\, \theta_{t-1} - \chi_{t/k} \alpha_t m_t \end{aligned}\right.\end{equation}

This does not require introducing a new set of parameters.

Adam

For Adam, the update formulas are as follows:

\begin{equation}\left\{\begin{aligned} m_t =&\, \beta_1 m_{t-1} + \left(1 - \beta_1\right) g_t\\ v_t =&\, \beta_2 v_{t-1} + \left(1 - \beta_2\right) g_t^2\\ \hat{m}_t =&\, m_t\left/\left(1 - \beta_1^t\right)\right.\\ \hat{v}_t =&\, v_t\left/\left(1 - \beta_2^t\right)\right.\\ \theta_t =&\, \theta_{t-1} - \alpha_t \hat{m}_t\left/\sqrt{\hat{v}_t + \epsilon}\right. \end{aligned}\right.\end{equation}

The handling of the first moment $m$ is consistent with the aforementioned SGDM, so the key lies in the second moment $v$. According to the definition, when accumulating gradients for $k$ steps, the update formula for $v$ is:

\begin{equation} v_{kt} = \beta_2 v_{k(t-1)} + \left(1 - \beta_2\right) \left(\frac{1}{k}\sum_{i=1}^k g_{k(t-1) + i}\right)^2\end{equation}

Unfortunately, due to the existence of the square, $v$ cannot be decomposed in the same way as Equation $\eqref{eq:m-part}$. Therefore, strictly speaking, it is impossible to implement gradient accumulation for Adam without adding a set of cache parameters. However, if we assume that the "square of the average" can be estimated by the "average of the squares," i.e., assuming

\begin{equation}\left(\frac{1}{k}\sum_{i=1}^k g_{k(t-1) + i}\right)^2\sim \frac{1}{k}\sum_{i=1}^k g_{k(t-1) + i}^2\end{equation}

Here $\sim$ represents a relatively tight linear correlation, but not necessarily an approximate equality; for example, they could differ by a constant multiple. Then we can still modify the formula for $v$ just like we did for $m$:

\begin{equation}v_t = \big[(\beta_2 - 1)\chi_{(t - 1)/k} + 1\big] v_{t-1} + \frac{1}{k}\left(1 - \beta_2\right) g_t^2\end{equation}

Combining these, we get the modified Adam formula:

\begin{equation}\left\{\begin{aligned} m_t =&\, \big[(\beta_1 - 1)\chi_{(t - 1)/k} + 1\big] m_{t-1} + \frac{1}{k}\left(1 - \beta_1\right) g_t\\ v_t =&\, \big[(\beta_2 - 1)\chi_{(t - 1)/k} + 1\big] v_{t-1} + \frac{1}{k}\left(1 - \beta_2\right) g_t^2\\ \hat{m}_t =&\, m_t\left/\left(1 - \beta_1^{t/k}\right)\right.\\ \hat{v}_t =&\, v_t\left/\left(1 - \beta_2^{t/k}\right)\right.\\ \theta_t =&\, \theta_{t-1} - \chi_{t/k}\alpha_t \hat{m}_t\left/\sqrt{\hat{v}_t + \epsilon}\right. \end{aligned}\right.\end{equation}

Through experiments, I found that Adam modified in this way indeed achieves similar effects to gradient accumulation.

Reflections and Conclusion

In general, the two modified versions of the optimizers above primarily involve two changes:
1. Modifying the update formulas for $m$ and $v$;
2. Parameters are changed to update once every $k$ steps (or the learning rate is changed from $\alpha_t$ to $\chi_{t/k}\alpha_t$).

The update formulas for $m, v$ were changed to (without loss of generality, taking $m$ as an example):

\begin{equation}m_t = \big[(\beta - 1)\chi_{(t - 1)/k} + 1\big] m_{t-1} + \frac{1}{k}\left(1 - \beta\right) g_t\end{equation}

If we want to map this back to the original format $m_t = \beta m_{t-1} + (1 - \beta) g_t$, we might guess that the above iteration is equivalent to replacing $\beta$ with $\tilde{\beta}=1 - \frac{1}{k}(1-\beta)$. In fact, this is indeed the case. We can prove that the exponential moving average momentum after replacing $\beta$ with $\tilde{\beta}=1 - \frac{1}{k}(1-\beta)$ is approximately equal to the exponential moving average momentum of $k$-step accumulated gradients with the original $\beta$:

\begin{equation}\begin{aligned} m_{kt} =&\, \tilde{\beta}m_{kt-1} + (1 - \tilde{\beta})g_{kt} \\ =&\, \tilde{\beta}^2 m_{kt-2} + \tilde{\beta}(1 - \tilde{\beta})g_{kt - 1} + (1 - \tilde{\beta})g_{kt}\\ =&\,\cdots\\ =&\, \tilde{\beta}^k m_{k(t-1)} + (1 - \tilde{\beta})\sum_{i=1}^k \tilde{\beta}^{i-1} g_{kt-i+1} \end{aligned}\end{equation}

Since $\tilde{\beta}$ is typically very close to 1, we approximately assume that $\tilde{\beta}^{i-1}\approx 1, \forall i=1,2,\cdots,k$, and

\begin{equation}\tilde{\beta}^k = (1 - (1 - \tilde{\beta}))^k \approx 1 - k(1 - \tilde{\beta}) = \beta\end{equation}

So we have the approximation:

\begin{equation}\begin{aligned} m_{kt} \approx &\, \beta m_{k(t-1)} + (1 - \tilde{\beta})\sum_{i=1}^k g_{kt-i+1} \\ = &\, \beta m_{k(t-1)} + (1 - \beta) \frac{1}{k}\sum_{i=1}^k g_{kt-i+1} \end{aligned}\end{equation}

This is the form of gradient accumulation.

So, if you are facing performance issues caused by a small batch_size but don't want to implement gradient accumulation yourself or modify the optimizer, you can try the following operations:
1. Change all $\beta$ parameters to $1-\frac{1}{k}(1-\beta)$;
2. Multiply the learning rate by $\chi_{t/k}$, so that parameters are updated only once every $k$ steps.

Interestingly, I found that point 2 is more important than point 1: even if we don't modify $\beta$ and simply change the parameter update frequency to once every $k$ steps, the results improve (provided that the small batch size is the current bottleneck of the model). This is the "counter-intuitive" phenomenon mentioned in the title: without changing much, simply updating fewer times actually leads to better results.

Summary

This article introduced a discovery made while debugging models—gradient accumulation is hidden within the momentum of optimizers. Then, a further simple analysis and experiments were conducted, finding that near-equivalent gradient accumulation effects can be achieved by adjusting sliding coefficients and learning rates, leading to the discovery of the counter-intuitive phenomenon that "fewer updates can result in better performance."