By 苏剑林 | December 12, 2025
Two years ago, I planned to start a "Scientific Alchemy" series, with the intention of systematically organizing classical theoretical results for optimizers. However, after writing the first article, "Making Alchemy More Scientific (I): Convergence of Average Loss in SGD," the project was shelved until now. The main reason was my feeling that the conditions these classical optimization conclusions depend on are too strict and far removed from practical applications. Especially in the LLM era, the reference value of these conclusions seems even more limited, so I lacked the motivation to continue writing.
However, while recently reflecting on Scaling Law-related issues, I discovered that these theoretical results are not as "useless" as I imagined; they can provide beneficial theoretical insights for some empirical results. Therefore, I am restarting this series to move the project forward and "repay" the "debts" owed previously.
Regarding notation, we will continue using the terminology from the first article, so we will not repeat the introductions here. The main conclusion of the first article was that under appropriate assumptions, SGD satisfies:
\begin{equation}\frac{1}{T}\sum_{t=1}^T L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - \frac{1}{T}\sum_{t=1}^T L(\boldsymbol{x}_t,\boldsymbol{\theta}^*)\leq \frac{R^2}{2T\eta_T} + \frac{G^2}{2T}\sum_{t=1}^T\eta_t\label{leq:avg-1}\end{equation}where $R$ and $G$ are constants independent of the optimization trajectory, and the "appropriate assumptions" include:
1. $\boldsymbol{\Theta}$ is a bounded convex set, $R=\max\limits_{\boldsymbol{\theta}_1,\boldsymbol{\theta}_2\in \boldsymbol{\Theta}}\Vert\boldsymbol{\theta}_1-\boldsymbol{\theta}_2\Vert < \infty$;
2. For any $\boldsymbol{\theta}\in \boldsymbol{\Theta}$ and any $\boldsymbol{x}$, $L(\boldsymbol{x},\boldsymbol{\theta})$ is a convex function with respect to $\boldsymbol{\theta}$;
3. For any $\boldsymbol{\theta}\in \boldsymbol{\Theta}$ and any $\boldsymbol{x}$, $\Vert\nabla_{\boldsymbol{\theta}}L(\boldsymbol{x},\boldsymbol{\theta})\Vert\leq G < \infty$;
4. The learning rate $\eta_t$ is a monotonically decreasing function of $t$ (i.e., $\eta_t\geq \eta_{t+1}$);
The most "awkward" part here is likely the "boundedness" in the first point regarding the "bounded convex set." Since the optimization trajectory of SGD cannot in principle be guaranteed to be bounded, we have to modify SGD to Projected SGD to ensure boundedness:
\begin{equation}\boldsymbol{\theta}_{t+1} = \Pi_{\boldsymbol{\Theta}}\big(\boldsymbol{\theta}_t - \eta_t \boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\big)\in \boldsymbol{\Theta}\label{eq:sgd-p}\end{equation}Of course, boundedness is not an issue in practice; appropriate boundedness can even enhance the stability of optimization algorithms. However, from the perspective of "theoretical purity," adding an extra constraint is always somewhat uncomfortable. Therefore, in this article, we will first remove the condition of boundedness to obtain a more concise and enlightening proof.
Incidentally, the proofs in this section and subsequent chapters mainly refer to the article "Last Iterate of SGD Converges (Even in Unbounded Domains)" from the blog Parameter-free. This is a classic blog on optimization theory with many theoretical introductions to optimizers, many of which are original to the author; it is well worth reading.
Just like the proof in the previous article, our starting point is the identity:
\begin{equation}\begin{aligned} \Vert\boldsymbol{\theta}_{t+1} - \boldsymbol{\varphi}\Vert^2=&\, \Vert\boldsymbol{\theta}_t - \eta_t \boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)- \boldsymbol{\varphi}\Vert^2 \\ =&\, \Vert\boldsymbol{\theta}_t - \boldsymbol{\varphi}\Vert^2 - 2\eta_t (\boldsymbol{\theta}_t- \boldsymbol{\varphi})\cdot\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t) + \eta_t^2\Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert^2 \end{aligned}\end{equation}Here $\boldsymbol{\varphi}$ is an arbitrary vector. In the previous article, we directly set $\boldsymbol{\varphi}=\boldsymbol{\theta}^*$, but here we retain the possibility of it taking any value. Following this identity, the previous article adopted a scheme of dividing both sides by $2\eta_t$ and then summing over $t$, which necessitated the boundedness condition for easy manipulation in subsequent steps; in this article, we do not intend to divide by $2\eta_t$, but instead sum directly over $t$ on both sides. In this way, the intermediate terms $\Vert\boldsymbol{\theta}_t - \boldsymbol{\varphi}\Vert^2$ naturally cancel out:
\begin{equation}\begin{aligned} 2\sum_{t=1}^T \eta_t (\boldsymbol{\theta}_t- \boldsymbol{\varphi})\cdot\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t) =&\, \Vert\boldsymbol{\theta}_1 - \boldsymbol{\varphi}\Vert^2 - \Vert\boldsymbol{\theta}_{T+1} - \boldsymbol{\varphi}\Vert^2 + \sum_{t=1}^T \eta_t^2 \Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert^2 \\ \leq&\, \Vert\boldsymbol{\theta}_1 - \boldsymbol{\varphi}\Vert^2 + G^2\sum_{t=1}^T \eta_t^2 \end{aligned}\end{equation}Using the convexity of $L$, we know that $(\boldsymbol{\theta}_t- \boldsymbol{\varphi})\cdot\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t) \geq L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi})$. Substituting this into the left side of the above equation and rearranging gives:
\begin{equation}\sum_{t=1}^T \eta_t [L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi})]\leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\varphi}\Vert^2}{2} + \frac{G^2}{2}\sum_{t=1}^T \eta_t^2\label{leq:avg-2-mid1}\end{equation}This is already very close to the result we want, and no point-wise boundedness was used during the derivation; we only need to make an assumption about the distance between the initialization $\boldsymbol{\theta}_1$ and the point $\boldsymbol{\varphi}$.
Moving forward, if we could further assume the non-negativity of $L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi})$, we could combine the monotonic decrease of $\eta_t$ to get $\eta_t [L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi})] \geq \eta_T [L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi})]$, and then substitute this into the previous equation to get:
\begin{equation}\frac{1}{T}\sum_{t=1}^T L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - \frac{1}{T}\sum_{t=1}^T L(\boldsymbol{x}_t,\boldsymbol{\varphi})\leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\varphi}\Vert^2}{2 T \eta_T} + \frac{G^2}{2T}\sum_{t=1}^T \frac{\eta_t^2}{\eta_T}\end{equation}The problem is that even if we set $\boldsymbol{\varphi}$ to be the global optimum $\boldsymbol{\theta}^*$, we cannot guarantee that $L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi})$ is non-negative, because the optimal point is optimal for the entire population but not necessarily for individual samples. To avoid this trouble, we notice that the right side of \eqref{leq:avg-2-mid1} is independent of $\boldsymbol{x}_t$, while the left side is dependent on $\boldsymbol{x}_t$. Therefore, by taking the expectation on both sides, the inequality still holds:
\begin{equation}\sum_{t=1}^T \eta_t \mathbb{E}[L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi})]\leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\varphi}\Vert^2}{2} + \frac{G^2}{2}\sum_{t=1}^T \eta_t^2\label{leq:avg-2-mid2}\end{equation}Here $\mathbb{E}$ is $\mathbb{E}_{\boldsymbol{x}_1,\boldsymbol{x}_2,\cdots,\boldsymbol{x}_T}$, which means "repeating SGD infinitely many times with different sampling orders and averaging each optimization trajectory." Next, since $\boldsymbol{\varphi}$ is an arbitrary vector independent of the data, we have $\mathbb{E}[L(\boldsymbol{x}_t,\boldsymbol{\varphi})] = \mathbb{E}_{\boldsymbol{x}_t}[L(\boldsymbol{x}_t,\boldsymbol{\varphi})] = L(\boldsymbol{\varphi})$, which is exactly our optimization goal. As for $L(\boldsymbol{x}_t,\boldsymbol{\theta}_t)$, note that $\boldsymbol{\theta}_t$ depends on $\boldsymbol{x}_1,\cdots,\boldsymbol{x}_{t-1}$, so we have:
\begin{equation}\mathbb{E}[L(\boldsymbol{x}_t,\boldsymbol{\theta}_t)] = \mathbb{E}_{\boldsymbol{x}_1,\cdots,\boldsymbol{x}_t}[L(\boldsymbol{x}_t,\boldsymbol{\theta}_t)] = \mathbb{E}_{\boldsymbol{x}_1,\cdots,\boldsymbol{x}_{t-1}}[\mathbb{E}_{\boldsymbol{x}_t}[L(\boldsymbol{x}_t,\boldsymbol{\theta}_t)]] = \mathbb{E}_{\boldsymbol{x}_1,\cdots,\boldsymbol{x}_{t-1}}[L(\boldsymbol{\theta}_t)] = \mathbb{E}[L(\boldsymbol{\theta}_t)]\end{equation}Due to the data-independence of $\boldsymbol{\varphi}$, $L(\boldsymbol{\varphi})$ can also be written as $\mathbb{E}[L(\boldsymbol{\varphi})]$. Substituting these into Equation \eqref{leq:avg-2-mid2} yields:
\begin{equation}\sum_{t=1}^T \eta_t \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\varphi})]\leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\varphi}\Vert^2}{2} + \frac{G^2}{2}\sum_{t=1}^T \eta_t^2\label{leq:avg-2-mid3}\end{equation}Before proceeding further, I intend to discuss a few more words: Why do we introduce this form of expectation, and why can we do so?
The answer to "Why" is very simple—to allow the derivation to continue. For example, we want the non-negativity of $L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi})$, which cannot be guaranteed. However, after adding the expectation, we can obtain the non-negativity of $L(\boldsymbol{\theta}_t) - L(\boldsymbol{\varphi})$ as long as $\boldsymbol{\varphi}$ is taken as the theoretical optimum. Generally speaking, the main purpose of this step is to eliminate the dependence on the data variable $\boldsymbol{x}_t$ through expectation, thereby simplifying subsequent derivations.
As for "Why can we," some readers might think that we don't actually repeat training many times and average them, so why care about such an expectation result? To answer this, we first need to think: what do we essentially want? In fact, we only want a convergence conclusion for SGD to some extent. Since the objective function is restricted to convex functions, we hope this conclusion is mathematically rigorous.
For a complex stochastic process, mathematical expectation is often the most basic result we can calculate, and it might even be the only thing that can be calculated. Therefore, adding expectations to both sides, while causing the conclusion to deviate from practice, still provides a valuable convergence conclusion for reference. As is well known, practice is usually more complex, so as long as it has some connection with practice, can describe certain phenomena, or generate some insights, it is a valuable theoretical result.
Coming back to the main point: Now that we have Equation \eqref{leq:avg-2-mid3}, we set $\boldsymbol{\varphi}$ as the theoretical optimum $\boldsymbol{\theta}^*$ to obtain the non-negativity of $L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}^*)$. Then, continuing to assume the monotonic decrease of $\eta_t$, we can replace $\eta_t$ with $\eta_T$, and finally divide both sides by $T\eta_T$ to get:
\begin{equation}\frac{1}{T}\sum_{t=1}^T \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}^*)] \leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2T\eta_T} + \frac{G^2}{2T}\sum_{t=1}^T \frac{\eta_t^2}{\eta_T}\label{leq:avg-2}\end{equation}Comparing Equation \eqref{leq:avg-2} with Equation \eqref{leq:avg-1}, the dependencies on $T$ and $\eta_t$ are similar: $R^2$ in the first term on the right is replaced by $\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2$, which is relatively smaller; while the summation of $\eta_t$ in the second term on the right is replaced by $\eta_t^2/\eta_T$, which is usually larger. Between a smaller and a larger change, is the difference significant? Let's observe two examples. The first example is a constant learning rate $\frac{\alpha}{\sqrt{T}}$; in this case, conclusions \eqref{leq:avg-1} and \eqref{leq:avg-2} are essentially the same. Taking \eqref{leq:avg-2} as an example, substituting gives:
\begin{equation}\frac{1}{T}\sum_{t=1}^T \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}^*)] \leq \frac{1}{2\sqrt{T}}\left(\frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{\alpha} + G^2\alpha\right)\end{equation}This means both can achieve a convergence rate of $\mathcal{O}(1/\sqrt{T})$. The second example is a dynamic learning rate $\eta_t = \frac{\alpha}{\sqrt{t}}$. We already proved in the previous article that substituting this into conclusion \eqref{leq:avg-1} also yields a convergence rate of $\mathcal{O}(1/\sqrt{T})$. However, if substituted into \eqref{leq:avg-2}, the result is:
\begin{equation}\frac{1}{T}\sum_{t=1}^T \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}^*)] \leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2 \alpha \sqrt{T}} + \frac{G^2\alpha}{2\sqrt{T}}\sum_{t=1}^T \frac{1}{t} \sim \mathcal{O}\left(\frac{\ln T}{\sqrt{T}}\right)\end{equation}Therefore, the cost of generalizing to unbounded domains is that the convergence rate is slightly relaxed from $\mathcal{O}(1/\sqrt{T})$ to $\mathcal{O}(\ln T/\sqrt{T})$. This degree of relaxation makes no difference for practical purposes. Furthermore, this result is only for $\eta_t = \frac{\alpha}{\sqrt{t}}$; by setting $\eta_t = \frac{\alpha}{\sqrt{t\ln t}}$, it can be pushed closer to $\mathcal{O}(\sqrt{\ln T/T})$, but this is essentially just a mathematical exercise with no fundamental difference.
In fact, there is an even simpler way to handle Equation \eqref{leq:avg-2-mid3}: directly divide both sides by $\sum_{t=1}^T \eta_t$ and then substitute $\boldsymbol{\varphi}=\boldsymbol{\theta}^*$ to get:
\begin{equation}\frac{\sum_{t=1}^T \eta_t \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}^*)]}{\sum_{t=1}^T \eta_t}\leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2\sum_{t=1}^T \eta_t} + \frac{G^2}{2}\frac{\sum_{t=1}^T \eta_t^2}{\sum_{t=1}^T \eta_t}\label{leq:avg-3}\end{equation}This way, we do not need to assume the monotonic decrease of the learning rate, which makes it a very loose and general convergence conclusion. it describes the convergence properties of the weighted average along the optimization trajectory. Some readers might question: why use weighted averages? The answer remains the same—"we only want a convergence conclusion for SGD to some extent"—so from the perspective of approximating the optimal loss value, does it really matter whether it is weighted or not?
To some extent, Equation \eqref{leq:avg-3} brings us more insights than Equation \eqref{leq:avg-2}. For example, to make the right side converge as quickly as possible, the sum of learning rates $\sum_{t=1}^T \eta_t$ should be as large as possible, while the sum of squares $\sum_{t=1}^T \eta_t^2$ should be as small as possible. If we want to converge to the optimum as $T\to\infty$, the learning rate should satisfy:
\begin{equation}\sum_{t=1}^{\infty} \eta_t = \infty \qquad\text{and}\qquad \sum_{t=1}^{\infty} \eta_t^2 \bigg/ \sum_{t=1}^{\infty} \eta_t = 0\end{equation}This gives us the two classic conditions for learning rate strategies. Furthermore, Equation \eqref{leq:avg-3} does not explicitly depend on the terminal learning rate $\eta_T$, thus allowing $\eta_T\to 0$, enabling us to schedule the learning rate more flexibly. In short, Equation \eqref{leq:avg-3} gives us a more "composed" feeling and is aesthetically more pleasing.
In this article, we restarted the "Scientific Alchemy" series and generalized the conclusion of SGD's convergence in bounded domains from the previous article to unbounded domains, obtaining richer results.