Making Alchemy More Scientific (Part 3): Final Loss Convergence of SGD

By 苏剑林 | December 16, 2025

We already have two articles discussing the convergence properties of SGD. However, they only cover convergence results for the average loss values, which only guarantees that we can find the optimal loss value but does not guarantee finding the location $\theta^*$ of that optimal value. This is a significant gap between current theoretical conclusions and practice. Intuitively, the weights $\theta_T$ at the end of training should be closer to the theoretical optimum $\theta^*$. We want to know if theory supports this point.

Therefore, in this article, we will transform the convergence results of the average loss into convergence results for the final loss, providing a preliminary theoretical understanding of how far $\theta_T$ is from $\theta^*$.

Finding the Location

We start from the article "Making Alchemy More Scientific (Part 2): Extending the Conclusion to Unbounded Domains". Its core result is the inequality:

\begin{equation} \sum_{t=1}^T \eta_t \mathbb{E}[L(\theta_t) - L(\varphi)] \le \frac{\|\theta_1 - \phi\|^2}{2} + \frac{G^2}{2} \sum_{t=1}^T \eta_t^2 \label{eq:avg-2-mid3-orig} \end{equation}

By assuming the monotonicity of $\eta_t$, replacing $\eta_t$ on the left side with $\eta_T$, and substituting $\varphi = \theta^*$, we obtain one of the conclusions from the previous post:

\begin{equation} \frac{1}{T} \sum_{t=1}^T \mathbb{E}[L(\theta_t) - L(\theta^*)] \le \frac{\|\theta_1 - \theta^*\|^2}{2T\eta_T} + \frac{G^2}{2T} \frac{\sum_{t=1}^T \eta_t^2}{\eta_T} \label{eq:avg-2} \end{equation}

As mentioned at the beginning, this is only a convergence result for loss values. We are more interested in finding the actual location of convergence. To this end, a simple approach is to utilize the convexity of $L$ and Jensen's inequality to obtain:

\begin{equation} \frac{1}{T} \sum_{t=1}^T \mathbb{E}[L(\theta_t)] = \mathbb{E}\left[\frac{1}{T} \sum_{t=1}^T L(\theta_t)\right] \ge \mathbb{E}\left[L\left(\frac{1}{T} \sum_{t=1}^T \theta_t\right)\right] \end{equation}

Defining $\bar{\theta}_T = \frac{1}{T} \sum_{t=1}^T \theta_t$, we then have:

\begin{equation} \mathbb{E}[L(\bar{\theta}_T) - L(\theta^*)] \le \frac{\|\theta_1 - \theta^*\|^2}{2T\eta_T} + \frac{G^2}{2T} \frac{\sum_{t=1}^T \eta_t^2}{\eta_T} \end{equation}

This means that the loss value $L(\bar{\theta}_T)$ corresponding to the centroid $\bar{\theta}_T$ of the training trajectory $\theta_1, \theta_2, \dots, \theta_T$ converges to $L(\theta^*)$ on average. This implies that $\bar{\theta}_T$ converges to $\theta^*$ on average (since the minimum point of a strictly convex function is unique). This explains the rationality of operations like moving averages on the training trajectory to obtain better weights, including the Merge operation in WSM (Warmup-Stable and Merge).

Preparations

Calculating $\bar{\theta}_T$ provides a way to find $\theta^*$, but it doesn't fully answer the question raised at the beginning—we want to know the conclusion of $\theta_T$ converging to $\theta^*$. Next, we follow the line of thought from "Last Iterate of SGD Converges (Even in Unbounded Domains)" to transform the average loss convergence into final loss convergence.

Before the formal proof, some preparation is needed. One task is to extend equation \eqref{eq:avg-2-mid3-orig}. From its proof, we know the lower bound of the summation can theoretically be anything. That is, we can replace the starting point $\theta_1$ with any $\theta_{T-k}$. The inequality still holds, but now $\theta_{T-k}$ might also be related to $x_t$, so we must add $\mathbb{E}$ to the right side:

\begin{equation} \sum_{t=T-k}^T \eta_t \mathbb{E}[L(\theta_t) - L(\varphi)] \le \frac{\mathbb{E}[\|\theta_{T-k} - \varphi\|^2]}{2} + \frac{G^2}{2} \sum_{t=T-k}^T \eta_t^2 \end{equation}

Again assuming the monotonicity of $\eta_t$, replacing $\eta_t$ on the left with $\eta_T$, and dividing both sides by $\eta_T$, we get:

\begin{equation} \sum_{t=T-k}^T \mathbb{E}[L(\theta_t) - L(\varphi)] \le \frac{\mathbb{E}[\|\theta_{T-k} - \varphi\|^2]}{2\eta_T} + \frac{G^2}{2} \frac{\sum_{t=T-k}^T \eta_t^2}{\eta_T} \label{eq:last-mid1} \end{equation}

Here, $\varphi$ is any vector independent of the data. However, this "data independence" is relative. Looking back at the proof, when we choose the starting point as $T-k$, it can be related to at most $x_1, x_2, \dots, x_{T-k-1}$. Specifically, $\theta_{T-k}$ satisfies this condition. Substituting $\varphi = \theta_{T-k}$, we get:

\begin{equation} \sum_{t=T-k}^T \mathbb{E}[L(\theta_t) - L(\theta_{T-k})] \le \frac{G^2}{2} \frac{\sum_{t=T-k}^T \eta_t^2}{\eta_T} \label{eq:last-mid2} \end{equation}

This is an important intermediate conclusion for later.

Key Identity

To transform the average loss conclusion into the final form, we also need to prepare a very critical identity:

\begin{equation} q_T = \frac{1}{T} \sum_{t=1}^T q_t + \sum_{k=1}^{T-1} \frac{1}{k(k+1)} \sum_{t=T-k}^T (q_t - q_{T-k}) \label{eq:qt} \end{equation}

This identity ingeniously establishes a link between the final value and the average value. I spent few days trying to get an intuitive understanding of it but failed, so I can only introduce its proof step-by-step. The proof idea is to consider the cumulative average of $q_t$ from the end to the beginning. Define $S_k = \frac{1}{k} \sum_{t=T-k+1}^T q_t$. We can write:

\begin{equation} \begin{aligned} k S_k &= (k+1) S_{k+1} - q_{T-k} \\ &= k S_{k+1} + (S_{k+1} - q_{T-k}) \\ &= k S_{k+1} + \frac{1}{k+1} \sum_{t=T-k}^T (q_t - q_{T-k}) \end{aligned} \end{equation}

Dividing both sides by $k$ and then summing over $k = 1 \sim T-1$ gives:

\begin{equation} S_1 = S_T + \sum_{k=1}^{T-1} \frac{1}{k(k+1)} \sum_{t=T-k}^T (q_t - q_{T-k}) \end{equation}

Finally, substituting back the original definitions of $S_1$ and $S_T$ yields equation \eqref{eq:qt}. The core of the entire derivation is using the "cumulative average" operation as a natural transition from the final value $q_T$ to the average value $\frac{1}{T} \sum_{t=1}^T q_t$. In the original blog, this formula appeared in a slightly different inequality form, but I believe the identity is more fundamental, and the subsequent proof only requires the equation form.

Completing the Proof

Now we can complete the proof in one go. Define $q_t = \mathbb{E}[L(\theta_t) - L(\theta^*)]$. Substituting this into identity \eqref{eq:qt} gives:

\begin{equation} \mathbb{E}[L(\theta_T) - L(\theta^*)] = \underbrace{\frac{1}{T} \sum_{t=1}^T \mathbb{E}[L(\theta_t) - L(\theta^*)]}_{\eqref{eq:avg-2}} + \underbrace{\sum_{k=1}^{T-1} \frac{1}{k(k+1)} \sum_{t=T-k}^T \mathbb{E}[L(\theta_t) - L(\theta_{T-k})]}_{\eqref{eq:last-mid2}} \end{equation}

Substituting inequalities \eqref{eq:avg-2} and \eqref{eq:last-mid2} respectively, we get:

\begin{equation} \mathbb{E}[L(\theta_T) - L(\theta^*)] \le \frac{\|\theta_1 - \theta^*\|^2}{2T\eta_T} + \frac{G^2}{2T} \frac{\sum_{t=1}^T \eta_t^2}{\eta_T} + \frac{G^2}{2} \sum_{k=1}^{T-1} \frac{1}{k(k+1)} \frac{\sum_{t=T-k}^T \eta_t^2}{\eta_T} \label{eq:last-mid3} \end{equation}

For the last term, we have:

\begin{equation} \sum_{k=1}^{T-1} \frac{1}{k(k+1)} \frac{\sum_{t=T-k}^T \eta_t^2}{\eta_T} = \sum_{t=1}^T \frac{\eta_t^2}{\eta_T} \sum_{k=\max(1, T-t)}^{T-1} \frac{1}{k(k+1)} = \sum_{t=1}^T \frac{\eta_t^2}{\eta_T} \left( \frac{1}{\max(1, T-t)} - \frac{1}{T} \right) \label{eq:last-mid4} \end{equation}

Substituting back into equation \eqref{eq:last-mid3} gives:

\begin{equation} \begin{aligned} \mathbb{E}[L(\theta_T) - L(\theta^*)] &\le \frac{\|\theta_1 - \theta^*\|^2}{2T\eta_T} + \frac{G^2}{2} \sum_{t=1}^T \frac{\eta_t^2/\eta_T}{\max(1, T-t)} \\ &= \frac{\|\theta_1 - \theta^*\|^2}{2T\eta_T} + \frac{G^2\eta_T}{2} + \frac{G^2}{2} \sum_{t=1}^{T-1} \frac{\eta_t^2/\eta_T}{T-t} \end{aligned} \label{eq:last-1} \end{equation}

This gives us our final conclusion. Since we exchanged the summation order through identity \eqref{eq:last-mid4} and simplified beforehand, this result is more concise and general than the one in the original blog "Last Iterate of SGD Converges (Even in Unbounded Domains)".

Two Examples

It is not difficult to see that the form of the right side of \eqref{eq:last-1} is essentially the same as \eqref{eq:avg-2}. This suggests that the final loss and the average loss should have similar convergence rates. We still observe the performance of the final conclusion using two examples: constant learning rate and dynamic learning rate. First, for the constant learning rate $\eta_t = \eta$:

\begin{equation} \begin{aligned} \mathbb{E}[L(\theta_T) - L(\theta^*)] &\le \frac{\|\theta_1 - \theta^*\|^2}{2T\eta} + \frac{G^2\eta}{2} + \frac{G^2\eta}{2} \sum_{t=1}^{T-1} \frac{1}{T-t} \\ &\le \frac{\|\theta_1 - \theta^*\|^2}{2T\eta} + \frac{G^2\eta}{2} (2 + \ln T) \end{aligned} \end{equation}

Setting $\eta = \|\theta_1 - \theta^*\| / (G \sqrt{T(2 + \ln T)})$ allows the rightmost side to reach its minimum, with a convergence rate of $O(\sqrt{\ln T / T})$. This is slightly slower than the average loss convergence rate. In the previous post, we proved that under a constant learning rate, the average loss convergence rate can reach $O(1/\sqrt{T})$. Of course, these are differences in limiting cases; in practice, the difference brought by $\sqrt{\ln T}$ might be negligible.

Next, consider the dynamic learning rate $\eta_t = \alpha / \sqrt{t}$. Substituting into equation \eqref{eq:last-1} gives:

\begin{equation} \begin{aligned} \mathbb{E}[L(\theta_T) - L(\theta^*)] &\le \frac{\|\theta_1 - \theta^*\|^2}{2\alpha \sqrt{T}} + \frac{G^2\alpha}{2\sqrt{T}} + \frac{G^2\alpha}{2\sqrt{T}} \sum_{t=1}^{T-1} \frac{T}{t(T-t)} \\ &= \frac{\|\theta_1 - \theta^*\|^2}{2\alpha \sqrt{T}} + \frac{G^2\alpha}{2\sqrt{T}} + \frac{G^2\alpha}{2\sqrt{T}} \sum_{t=1}^{T-1} \left( \frac{1}{t} + \frac{1}{T-t} \right) \\ &\le \frac{\|\theta_1 - \theta^*\|^2}{2\alpha \sqrt{T}} + \frac{G^2\alpha}{2\sqrt{T}} + \frac{G^2\alpha}{\sqrt{T}} \sum_{t=1}^{T-1} \frac{1}{t} \\ &\le \frac{\|\theta_1 - \theta^*\|^2}{2\alpha \sqrt{T}} + \frac{G^2\alpha}{2\sqrt{T}} + \frac{G^2\alpha}{\sqrt{T}} (1 + \ln T) \\ &\sim O\left(\frac{\ln T}{\sqrt{T}}\right) \end{aligned} \end{equation}

As with the average loss convergence in unbounded domains from the previous article, the convergence rate under the dynamic learning rate $\eta_t = \alpha / \sqrt{t}$ is $O(\ln T / \sqrt{T})$, though the constant here is somewhat larger.

Summary

In this article, we extended the SGD convergence conclusions from average loss to final loss, considering how close the loss value at the end of training is to the theoretical optimum. This setting is more closely aligned with our actual training practices.