By 苏剑林 | December 26, 2025
In the previous article "Make Alchemy More Scientific (Part 3): Last-Iterate Convergence of SGD", we successfully converted the convergence conclusion from average loss to last-iterate loss, obtaining a convergence rate of \(O(\ln T / \sqrt{T})\). However, upon careful reflection, we find that this result doesn't quite align with intuition: according to experience, the last-iterate loss should be closer to the optimal value. If the average loss convergence rate can reach \(O(1/\sqrt{T})\), why would the last-iterate convergence be slower?
The latest progress on this issue comes from "Optimal Linear Decay Learning Rate Schedules and Further Refinements". The paper first generalizes the key identity proven previously and then points out the importance of the learning rate schedule for last-iterate convergence, thereby accelerating the last-iterate loss convergence to \(O(1/\sqrt{T})\).
The results of the original paper are very rich. We will introduce them across multiple articles, and this piece will primarily follow the logic of the previous one to provide an initial introduction. To convert the average loss convergence conclusion into a last-iterate one, the key identity introduced in the previous article was:
\begin{equation} q_T = \frac{1}{T}\sum_{t=1}^T q_t + \sum_{k=1}^{T-1} \frac{1}{k(k+1)} \sum_{t=T-k}^T (q_t - q_{T-k}) \label{1} \end{equation}In this article, we generalize it into a weighted average version: define \(w_{k:T} \triangleq \sum_{t=k}^T w_t\), then we have:
\begin{equation} q_T = \frac{1}{w_{1:T}} \sum_{t=1}^T w_t q_t + \sum_{k=1}^{T-1} \left( \frac{1}{w_{k+1:T}} - \frac{1}{w_{k:T}} \right) \sum_{t=k}^T w_t (q_t - q_k) \label{eq:qt-g} \end{equation}The proof logic is essentially the same. Let \(\lambda_k = \sum_{t=T-k+1}^T w_t\) and \(S_k = \frac{1}{\lambda_k} \sum_{t=T-k+1}^T w_t q_t\), then
\begin{equation} \begin{aligned} \lambda_k S_k &= \lambda_{k+1} S_{k+1} - w_{T-k} q_{T-k} \\ &= \lambda_k S_{k+1} + w_{T-k}(S_{k+1} - q_{T-k}) \\ &= \lambda_k S_{k+1} + \frac{w_{T-k}}{\lambda_{k+1}} \sum_{t=T-k}^T w_t (q_t - q_{T-k}) \end{aligned} \label{3} \end{equation}Dividing both sides by \(\lambda_k\) and then summing over \(k=1 \sim T-1\), we get
\begin{equation} S_1 = S_T + \sum_{k=1}^{T-1} \frac{w_{T-k}}{\lambda_k \lambda_{k+1}} \sum_{t=T-k}^T w_t (q_t - q_{T-k}) \label{4} \end{equation}Noting that \(\frac{w_{T-k}}{\lambda_k \lambda_{k+1}} = \frac{1}{\lambda_k} - \frac{1}{\lambda_{k+1}}\), and then substituting the definitions of \(S_1\) and \(S_T\), we get
\begin{equation} q_T = \frac{1}{\sum_{t=1}^T w_t} \sum_{t=1}^T w_t q_t + \sum_{k=1}^{T-1} \left( \frac{1}{\sum_{t=T-k+1}^T w_t} - \frac{1}{\sum_{t=T-k}^T w_t} \right) \sum_{t=T-k}^T w_t (q_t - q_{T-k}) \label{5} \end{equation}Finally, by replacing \(T-k\) with \(k\), we obtain equation \eqref{eq:qt-g}.
Next, we still start from the core inequality of the second article "Make Alchemy More Scientific (Part 2): Extending Conclusions to Unbounded Domains":
\begin{equation} \sum_{t=1}^T \eta_t E[L(\theta_t) - L(\varphi)] \leq \frac{\|\theta_1 - \varphi\|^2}{2} + \frac{G^2}{2} \sum_{t=1}^T \eta_t^2 \label{leq:avg-2-mid3} \end{equation}Following the "Preparation" section logic from "Make Alchemy More Scientific (Part 3): Last-Iterate Convergence of SGD", we change the starting point to \(k\) and substitute \(\varphi = \theta_k\). However, we do not need to assume the monotonicity of \(\eta_t\) to divide both sides by \(\eta_T\); instead, we directly obtain:
\begin{equation} \sum_{t=k}^T \eta_t E[L(\theta_t) - L(\theta_k)] \leq \frac{G^2}{2} \sum_{t=k}^T \eta_t^2 \label{leq:avg-2-mid4} \end{equation}Substituting \(w_t = \eta_t\) and \(q_t = E[L(\theta_t)] - L(\theta^*)\) into identity \eqref{eq:qt-g}, we get:
\begin{equation} \begin{aligned} E[L(\theta_T) - L(\theta^*)] &= \underbrace{\frac{1}{\eta_{1:T}} \sum_{t=1}^T \eta_t E[L(\theta_t) - L(\theta^*)]}_{\eqref{leq:avg-2-mid3}} + \sum_{k=1}^{T-1} \left( \frac{1}{\eta_{k+1:T}} - \frac{1}{\eta_{k:T}} \right) \underbrace{\sum_{t=k}^T \eta_t E[L(\theta_t) - L(\theta_k)]}_{\eqref{leq:avg-2-mid4}} \\ &\leq \frac{\|\theta_1 - \phi\|^2}{2 \eta_{1:T}} + \frac{G^2}{2 \eta_{1:T}} \sum_{t=1}^T \eta_t^2 + \frac{G^2}{2} \sum_{k=1}^{T-1} \left( \frac{1}{\eta_{k+1:T}} - \frac{1}{\eta_{k:T}} \right) \sum_{t=k}^T \eta_t^2 \end{aligned} \label{leq:avg-2-mid5} \end{equation}Utilizing \(\sum_{k=1}^{T-1} \sum_{t=k}^T = \sum_{t=1}^T \sum_{k=1}^{\min(t, T-1)}\) for the second term:
\begin{equation} \begin{aligned} \sum_{k=1}^{T-1} \left( \frac{1}{\eta_{k+1:T}} - \frac{1}{\eta_{k:T}} \right) \sum_{t=k}^T \eta_t^2 &= \sum_{t=1}^T \eta_t^2 \sum_{k=1}^{\min(t, T-1)} \left( \frac{1}{\eta_{k+1:T}} - \frac{1}{\eta_{k:T}} \right) \\ &= \sum_{t=1}^T \eta_t^2 \left( \frac{1}{\eta_{\min(t+1, T):T}} - \frac{1}{\eta_{1:T}} \right) \end{aligned} \label{9} \end{equation}Substituting this back into equation \eqref{leq:avg-2-mid5} gives:
\begin{equation} E[L(\theta_T) - L(\theta^*)] \leq \frac{\|\theta_1 - \phi\|^2}{2 \eta_{1:T}} + \frac{G^2}{2} \sum_{t=1}^T \frac{\eta_t^2}{\eta_{\min(t+1, T):T}} \label{leq:last-2} \end{equation}This is the enhanced version of the last-iterate loss convergence result. It does not depend on the monotonic decreasing property of the learning rate, nor does it rely on dividing both sides by \(\eta_T\), thus providing more flexible space for learning rate scheduling. The previous article was only equivalent to a rougher version where \(\eta_{1:T}\) and \(\eta_{\min(t+1, T):T}\) were simply replaced by \(T\eta_T\) and \(\max(1, T-t)\eta_T\). This result is new, though it is essentially equivalent to Theorem 10 in Appendix F of the original paper.
In this section, we will see that under appropriate settings, equation \eqref{leq:last-2} can achieve a convergence rate of \(O(1/\sqrt{T})\). Here, "appropriate settings" mainly refers to the learning rate scheduling strategy. Unlike the previous constant learning rates or non-terminating rates like \(\alpha/\sqrt{t}\) or \(\alpha/t\), this time we choose "Linear Decay":
\begin{equation} \eta_t = \alpha \left( 1 - \frac{t}{T+1} \right) \label{eq:liner-decay} \end{equation}This learning rate function deserves a dedicated line for emphasis because it is one of the best practices for learning rate strategies. For example, "Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs" claims it is even superior to Cosine Decay. This reflects how our discussion is getting closer to practical scenarios.
Calculating term by term:
\begin{align} \eta_{1:T} &= \sum_{\tau=1}^T \alpha \left( 1 - \frac{\tau}{T+1} \right) = \frac{\alpha T}{2} \label{12} \\ \eta_{t+1:T} &= \sum_{\tau=t+1}^T \alpha \left( 1 - \frac{\tau}{T+1} \right) = \frac{\alpha (T-t)(T+1-t)}{2(T+1)} \label{13} \\ \frac{\eta_t^2}{\eta_{t+1:T}} &= \frac{2 \alpha (T+1-t)}{(T-t)(T+1)} \leq \frac{4 \alpha}{T+1} \label{14} \\ \sum_{t=1}^T \frac{\eta_t^2}{\eta_{\min(t+1, T):T}} &= \eta_T + \sum_{t=1}^{T-1} \frac{\eta_t^2}{\eta_{t+1:T}} \leq \frac{\alpha}{T+1} + \sum_{t=1}^{T-1} \frac{4 \alpha}{T+1} \leq 4 \alpha \label{15} \end{align}Substituting these results into \eqref{leq:last-2} yields:
\begin{equation} E[L(\theta_T) - L(\theta^*)] \leq \frac{\|\theta_1 - \phi\|^2}{\alpha T} + 2 G^2 \alpha \label{16} \end{equation}By taking \(\alpha = \frac{\|\theta_1 - \theta^*\|}{G \sqrt{2T}}\), we can minimize the right-hand side, achieving an \(O(1/\sqrt{T})\) last-iterate convergence rate. It should be noted that without introducing stronger assumptions, \(O(1/\sqrt{T})\) cannot be improved; this is guaranteed by information theory (refer to "Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization"). Thus, this is already the theoretical optimal convergence rate.
Some readers might wonder: how can we tell that a linear decay learning rate can achieve optimal convergence speed? Or more generally, how do we find \(\eta_1, \eta_2, \dots, \eta_T \geq 0\) such that the right-hand side of inequality \eqref{leq:last-2} is minimized? This is a multivariate function minimization problem. The standard approach is to take partial derivatives, but the exact solution is complex and difficult to analyze for behavior.
Here we consider continuous relaxation and then use the Calculus of Variations to provide a quick heuristic derivation (assuming the reader is familiar with variational methods). Let \(S_t = \eta_{\min(t+1, T):T}\). Then for \(t < T-1\), we have \(\eta_t = S_{t-1} - S_t \approx -\dot{S}_t\). We approximate \(\eta_t\) with \(-\dot{S}_t\) and use integrals to approximate sums, so the original problem approximates minimizing the following integral function:
\begin{equation} \frac{c}{S_0} + \int_0^T \frac{(\dot{S}_t)^2}{S_t} dt \label{eq:func-min} \end{equation}By definition \(S_T = 0\). Considering \(S_0\) as fixed first, this becomes a variational problem with fixed boundaries. Substituting into the Euler-Lagrange equation:
\begin{equation} \frac{d}{dt} \frac{2 \dot{S}_t}{S_t} = -\frac{(\dot{S}_t)^2}{S_t^2} \label{18} \end{equation}This is easily solved as \(S_t = S_0 (1 - t/T)^2\). Substituting into \eqref{eq:func-min} gives \(c/S_0 + 4S_0/T\). The minimum is reached at \(S_0 = \sqrt{cT/4}\). Thus the optimal solution is \(S_t = \sqrt{cT/4}(1 - t/T)^2\). Differentiating gives \(\eta_t = -\dot{S}_t = \sqrt{c/T}(1 - t/T)\). This yields the linear decay, including the constant factor \(\propto 1/\sqrt{T}\).
This result is still an approximation. We would need to plug it back into the original discrete format for a rigorous proof. In doing so, \(\eta_t \propto 1 - t/T\) would run into the trouble of a zero denominator, so slight adjustments lead to the final form in \eqref{eq:liner-decay}.
In the above derivations and conclusions, there are several key points worth noting. In a sense, they are milestones in the convergence theory of stochastic optimization.
First, if the learning rate is set to a constant, the conclusion of \eqref{leq:last-2} is consistent with the previous article. We have already proven that it can achieve at most a convergence rate of \(O(\ln T / \sqrt{T})\), which is not optimal. Meanwhile, the linear decay learning rate in \eqref{eq:liner-decay} can reach \(O(1/\sqrt{T})\). On one hand, this shows the necessity of learning rate decay for last-iterate convergence; on the other hand, it provides theoretical support for the linear decay strategy.
It is not difficult to prove that in the first three articles, the best convergence speed was achieved under a constant learning rate, but this constant was related to the total training steps \(T\), such as \(\alpha/\sqrt{T}\). Many works consider this a drawback and prefer strategies like \(\alpha/\sqrt{t}\) or \(\alpha/t\), which do not require prior knowledge of the training steps \(T\). These are "anytime" learning rate strategies—allowing you to stop, resume, or train for any amount of steps.
However, the practical performance of such strategies is usually not ideal. The new conclusions in this article show that after switching to last-iterate loss, new characteristics appear: the fastest convergence is neither a constant learning rate related to \(T\), nor a dynamic learning rate independent of \(T\), but rather "both," such as linear decay. Additionally, Cosine-type decay is also commonly used in practice. Their common keywords are: better endpoint, related to \(T\), and dynamic variation.
In other words, there is no "set it and forget it" learning rate strategy. Fine-tuning the learning rate strategy based on the number of training steps is necessary to obtain the best last-iterate convergence result. This actually aligns well with current Scaling Law practices. For example, Step Law finds that the optimal learning rate and optimal Batch Size should be finely adjusted based on the data volume. Note that once the data volume and Batch Size are given, the training steps \(T\) are determined, so these can also be said to be functions of training steps \(T\).
In a future article, we will also briefly discuss the connection between this series of conclusions and today's Scaling Laws. Stay tuned.
In this article, we generalized the core identity from the previous post and obtained the theoretical optimal last-iterate loss convergence rate. Interestingly, the learning rate strategy that achieved this result was not a constant learning rate, nor the traditional inverse step or inverse square root of step rates, but rather linear decay, which is much closer to our daily practice. Next, we will continue to explore the profound implications behind this conclusion.