By 苏剑林 | Feb 23, 2026
In the previous article "MoE World Tour: 6. Balancing via Optimal Allocation", we achieved load balancing by solving the following optimal allocation problem:
\[\max_{x_{i,j} \in \{0,1\}} \sum_{i,j} x_{i,j}s_{i,j} \quad \text{s.t.} \sum_{j} x_{i,j}=k, \sum_{i} x_{i,j} = \frac{mk}{n}\]
Where $\sum_{j} x_{i,j}=k$ indicates that each token activates exactly $k$ experts, and $\sum_{i} x_{i,j} = mk/n$ indicates that each expert is activated exactly $mk/n$ times. However, upon careful reflection, the former constraint is not strictly necessary for training or inference. What we truly need is the latter, which implies "on average, each token activates $k$ experts" and ensures load balancing across experts. This is sufficient to achieve the goals of MoE. Therefore, this article considers a more simplified problem:
\begin{equation}\max_{x_{i,j} \in \{0,1\}} \sum_{i,j} x_{i,j}s_{i,j} \quad \text{s.t.} \sum_{i} x_{i,j} = \frac{mk}{n}\label{eq:target}\end{equation}
Dynamic Activation
Assuming the reader is already familiar with the previous article, the derivations here will be relatively brief. If any steps are confusing, you can refer back to the previous post. Just like before, we first consider the relaxed version of objective \eqref{eq:target}:
\[\max_{x_{i,j} \in [0,1]} \sum_{i,j} x_{i,j}s_{i,j} \quad \text{s.t.} \sum_{i} x_{i,j} = \frac{mk}{n}\]
Then consider its equivalent max-min form:
\[\max_{x_{i,j} \in [0,1]} \min_{\beta_j} \sum_{i,j} x_{i,j}s_{i,j} - \sum_j \beta_j \left(\sum_i x_{i,j} - \frac{mk}{n}\right)\]
Similarly, the order of max and min is interchangeable. Exchanging them and rearranging yields:
\begin{equation}\min_{\beta_j} \max_{x_{i,j} \in [0,1]} \sum_{i,j} x_{i,j}(s_{i,j} - \beta_j) + \frac{mk}{n} \sum_j \beta_j\label{eq:dual}\end{equation}
The maximization step can be completed first, resulting in:
\[\begin{cases} x_{i,j}^* = 1, & s_{i,j} - \beta_j > 0 \\ x_{i,j}^* = 0, & s_{i,j} - \beta_j < 0 \\ x_{i,j}^* \in [0,1], & s_{i,j} - \beta_j = 0 \end{cases}\]
This tells us that any expert where $s_{i,j} - \beta_j > 0$ will be activated. The number of activated experts is not fixed, which formally matches the intuition designed in "MoE World Tour: 4. Invest More Where It's Hard", except here it is derived from the more fundamental objective \eqref{eq:target}.
One-Step Solution
Substituting $x_{i,j}^*$ back into equation \eqref{eq:dual}, since $x_{i,j}^*(s_{i,j} - \beta_j) = \max(0, s_{i,j} - \beta_j)$, the optimization target for $\beta_j$ simplifies to:
\begin{equation}\min_{\beta_j} \sum_{i,j} \max(0, s_{i,j} - \beta_j) + \frac{mk}{n} \sum_j \beta_j\label{eq:beta-obj}\end{equation}
The terms for each $\beta_j$ are independent, so it can be decomposed into $n$ independent sub-optimization problems. This means we can temporarily drop the subscript $j$:
\[\min_{\beta} \frac{mk}{n}\beta + \sum_i \max(0, s_i - \beta)\]
Arrange all $s_i$ in descending order as $s_{\sigma_1} \ge s_{\sigma_2} \ge \dots \ge s_{\sigma_m}$. Assuming $s_{\sigma_l} \ge \beta \ge s_{\sigma_{l+1}}$, we can remove the $\max$ operator:
\[\frac{mk}{n}\beta + \sum_{i=1}^l (s_{\sigma_i} - \beta) = \begin{cases} \sum_{i=1}^{mk/n} s_{\sigma_i} + \underbrace{\sum_{i=mk/n+1}^l (s_{\sigma_i} - \beta)}_{\ge 0, \text{ if } l \ge mk/n} \\ \sum_{i=1}^{mk/n} s_{\sigma_i} - \underbrace{\sum_{i=l+1}^{mk/n} (s_{\sigma_i} - \beta)}_{\le 0, \text{ if } l \le mk/n} \end{cases}\]
This shows that both $l > mk/n$ and $l < mk/n$ increase the objective. Therefore, the minimum is achieved at $l = mk/n$, meaning $\beta^*$ lies between the $(mk/n)$-th and $(mk/n+1)$-th largest elements of $s_i$. By convention, we take the $(mk/n+1)$-th largest element. Restoring the index $j$, for a given $j$, $\beta_j^*$ is the $(mk/n+1)$-th element when $s_{i,j}$ are sorted in descending order, otherwise known as the "$1-k/n$ quantile."
Note that since the variable to be solved is only $\boldsymbol{\beta}$, there are no alternating iteration steps. A single Quantile calculation yields the absolute balanced optimal solution! We still call this "**Quantile Balancing (QB)**."
Practical Essentials
If the reader has read "MoE World Tour: 6. Balancing via Optimal Allocation", they will realize that this article is essentially a streamlined version of it. However, simple does not mean simplistic; it is actually more elegant and efficient than the Top-k form of MoE. First, it activates experts based on whether $s_{i,j} - \beta_j > 0$, which is more efficient than selecting Top-k. Second, it obtains the optimal balance solution via a single Quantile step, which is more efficient and elegant.
Of course, a single-step optimal solution might overfit the current batch of training data. To avoid this, an improved approach is to perform an EMA (Exponential Moving Average) between the current batch's optimal solution and historical solutions. Note that we must still be careful of the "information leakage" trap: use the old $\beta$ to activate experts first, then update $\beta$.
The algorithm flow is as follows (with $\lambda=0.9$ in the author's experiments):
Quantile Balancing (QB) Dynamic Activation Version
Input: Score matrix $\boldsymbol{s}\in\mathbb{R}^{m\times n}$, previous step $\boldsymbol{\beta}\in\mathbb{R}^n$, decay rate $\lambda$
Output: Allocation scheme $\boldsymbol{x}\in\{0,1\}^{m\times n}$, new $\boldsymbol{\beta}\in\mathbb{R}^n$
1: $x_{i,j}=1$ if $s_{i,j} - \beta_j > 0$ else $0$
2: $\boldsymbol{\beta} \leftarrow \lambda\boldsymbol{\beta} + (1-\lambda)\text{des\_sort}(\boldsymbol{s}, \text{axis=0})_{[mk/n:mk/n+1]}$
3: Output $\boldsymbol{x}, \boldsymbol{\beta}$
|
In practice, finding the $1-k/n$ quantile of $\boldsymbol{s}$ can be expensive as it requires finding the $(mk/n+1)$-th largest element among $m$ elements, where $m$ equals "global samples $\times$ sequence length." Due to parallel strategies and gradient accumulation, exact implementation is often unacceptable. Thus, we use a compromise: divide samples into the largest acceptable mini-batches, calculate $\boldsymbol{\beta}$ for each mini-batch, and average them as the final result.
Expert Choice
Actually, there is a very simple way to understand this dynamic QB: "**Expert Choice**":
If you want each expert to be activated exactly $mk/n$ times, why not let the experts choose the tokens? That is, each expert selects its Top-$mk/n$ tokens. Isn't this just finding the $mk/n$ largest elements along $\text{axis=0}$?
However, naive Expert Choice has a fatal flaw: it requires comparing all tokens globally, leading to cross-sample and cross-sequence selection. This violates causality (seeing future information during training) and breaks training-inference consistency (inference results depend on batch size). This is why the method has struggled to gain traction previously.
The ingenuity of this article lies in transforming it into a Bias form: using the $(mk/n+1)$-th largest element (equivalent to the $1-k/n$ quantile) of each expert as a threshold $\boldsymbol{\beta}$. We reformulate Expert Choice as Token Choice—from "Expert selects Top-$mk/n$" to "Activate if $s_{i,j} - \beta_j > 0$". Crucially, by delaying the update of $\boldsymbol{\beta}$ until after the activation decision, we perfectly avoid the information leakage problem.
It is surprising that this Bias form, which fixes the flaws of Expert Choice, is naturally contained within the dual problem of the original linear programming—it has a truly pleasing aesthetic quality.
Initialization Strategy
To prevent information leakage during training, we must use the old $\boldsymbol{\beta}$ to decide activation first, then update $\boldsymbol{\beta}$. This makes the first few training steps sensitive to the initialization of $\boldsymbol{\beta}$. For instance, if $\boldsymbol{s}$ is all positive but $\boldsymbol{\beta}$ is initialized to zero, every token will activate every expert in the first step, causing computation to explode.
In "MoE World Tour: 4", we provided a simulation script to estimate suitable initialization. But here, knowing the optimal $\boldsymbol{\beta}$ is the "1-k/n quantile," we can make reasonable assumptions about the initial scores and derive an analytical formula for initialization. Specifically, if we assume the initial logits of the Router follow $\mathcal{N}(0, \sigma^2)$, then its $1-k/n$ quantile is:
\begin{equation}\sigma \cdot \Phi^{-1}\left(1 - \frac{k}{n}\right)\end{equation}
Where $\Phi^{-1}$ is the quantile function (inverse CDF) of the standard normal distribution. This provides the optimal $\boldsymbol{\beta}$ for that logit distribution.
Furthermore, if the Router uses an element-wise, monotonically increasing activation function like Sigmoid, the ranking remains unchanged, so simply apply the activation to the formula above. For Softmax, it's slightly more complex as it involves exponentiation and normalization. Assuming the normalization denominator is approximately constant, we can exponentiate the result and estimate the denominator by sampling:
\begin{equation}\frac{\exp\left[\sigma \cdot \Phi^{-1}\left(1 - \frac{k}{n}\right)\right]}{\sum_{i=1}^n \exp\left[\sigma \cdot \Phi^{-1}\left(1 - \frac{i}{n+1}\right)\right]}\end{equation}
As for estimating $\sigma$: if the router input dimension is $d$ with RMS Norm, and weights follow $\mathcal{N}(0, \tilde{\sigma}^2)$, then Router Logits approximately follow a normal distribution with mean 0 and variance $\tilde{\sigma}^2 d$, i.e., $\sigma \approx \tilde{\sigma}\sqrt{d}$. These are standard results for initialization [Reference].
Gradient Descent
Finally, if one wishes to avoid Quantile calculations entirely, gradient descent is an option. Let the loss function of objective \eqref{eq:beta-obj} be $\ell$:
\begin{equation}\min_{\beta_j} \underbrace{\sum_{i,j} \max(0, s_{i,j} - \beta_j) + \frac{mk}{n} \sum_j \beta_j}_{\text{denoted as } \ell}\end{equation}
It is differentiable, and the gradient is:
\begin{equation}\frac{\partial\ell}{\partial \beta_j} = \frac{mk}{n} - \sum_{i=1}^m \chi(s_{i,j} - \beta_j > 0)\end{equation}
Where $\chi$ is the indicator function. Calculating this gradient is cheaper than a global Quantile. With the gradient, we can perform gradient descent, specifically SignSGD to align with the Loss-Free MoE approach:
\begin{equation}\beta_j \leftarrow \beta_j - \gamma \text{sign}\left(\frac{\partial \ell}{\partial \beta_j}\right)\end{equation}
This can replace the Quantile update step for $\boldsymbol{\beta}$. Since this is an iterative algorithm, EMA can be omitted. In fact, this is exactly formula (9) proposed in "MoE World Tour: 4", now re-derived from the perspective of "Optimal Allocation + Dual Objective + Gradient Descent."
Summary
Continuing from the Quantile Balancing (QB) explored in the previous post, by removing the "each token must activate exactly $k$ experts" constraint, we have significantly simplified the solution. A single Quantile step achieves load balancing. Tokens decide which experts to activate simply by checking the sign of the biased score, eliminating the overhead of Top-k sorting.