MoE Tour: 3. A Different Approach to Allocation

By 苏剑林 | March 5, 2025

In this article, we continue our exploration of the load balancing problem in Mixture-of-Experts (MoE). In the previous post, "MoE Tour: 2. Concern is Not with Scarcity, but with Inequality," we primarily discussed the approach of promoting load balance through Auxiliary Loss (Aux Loss). While Aux Loss is simple and intuitive, it has a significant drawback—the weight is difficult to tune. If the weight is too low, it fails to promote balance; if it is too high, it easily harms the Language Model (LM) Loss. Consequently, the industry has been constantly searching for alternative solutions.

The method we want to share today is called the "Loss-Free" scheme, proposed by DeepSeek in "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts." Compared to DeepSeek's many dazzling open-source works, this paper might seem less prominent. However, in my view, its potential academic influence may far surpass other works because the proposed method is not only simple and effective but also highly universal, making it an instant classic.

General Idea

Faced with load imbalance, the idea of Aux Loss is to guide the Router to produce balanced scores through an additional loss. In contrast, the idea of Loss-Free is to change the allocation method itself—that is, it does not change the Router's existing scores, but rather changes the $\mathop{\text{argtop}}_k \boldsymbol{\rho}$ allocation mechanism.

There have been previous efforts in this direction. For example, in 2021, Facebook proposed BASE Layer, which treats Expert allocation as a linear assignment problem. That is, it treats load balance as a constraint and seeks to find an allocation result where the total Router score is as high as possible under that constraint, which can be solved using the Hungarian algorithm or similar methods. However, such schemes require knowing the scores of all tokens simultaneously. For auto-regressive LLMs, this is only applicable during training; inference still has to rely on $\mathop{\text{argtop}}_k \boldsymbol{\rho}$, leading to training-inference inconsistency. Furthermore, due to the limitations of current solvers, it is typically only applicable to the $k=1$ scenario.

In comparison, the Loss-Free approach is remarkably simple and effective. It notes the fact that we can always introduce a bias term $\boldsymbol{b}$ such that the allocation $\mathop{\text{argtop}}_k \boldsymbol{\rho} + \boldsymbol{b}$ is balanced. Therefore, it modifies the MoE form to:

\[\boldsymbol{y} = \sum_{i\in \mathop{\text{argtop}}_k \boldsymbol{\rho}} \rho_i \boldsymbol{e}_i\qquad\to\qquad \boldsymbol{y} = \sum_{i\in \mathop{\text{argtop}}_k \boldsymbol{\rho} + \boldsymbol{b}} \rho_i \boldsymbol{e}_i\]

Here, $\boldsymbol{b}$ is an input-independent vector determined during the training process. Once training is complete, it remains fixed, and therefore can be used during the inference stage as well. In other words, training and inference share identical forms. Note that the value multiplied by $\boldsymbol{e}_i$ is still $\rho_i$ and not $\rho_i + b_i$. This means $\boldsymbol{b}$ only participates in the allocation process and does not participate in the MoE forward computation. Thus, we have no special requirements regarding the positivity or negativity of $\boldsymbol{b}$ or $\boldsymbol{\rho} + \boldsymbol{b}$.

Handcrafting Gradients

How do we train $\boldsymbol{b}$? We know the optimization direction for $\boldsymbol{b}$ is naturally to promote load balance. To this end, following the notation of the previous article, we first define $\boldsymbol{f}=[f_1, f_2, \cdots, f_n]$:

\begin{equation}f_i = \left\{\begin{aligned}1/k, \quad i\in \mathop{\text{argtop}}_k \boldsymbol{\rho}+\boldsymbol{b} \\ 0, \quad i\not\in \mathop{\text{argtop}}_k \boldsymbol{\rho}+\boldsymbol{b}\end{aligned}\right.\end{equation}

And $\boldsymbol{F}=\mathbb{E}[\boldsymbol{f}]$. Here $\boldsymbol{F}$ naturally represents the current load distribution of experts under the bias $\boldsymbol{b}$. Next, we define the uniform distribution as $\boldsymbol{Q}=(1/n, 1/n, \cdots, 1/n)$. Load balancing then becomes equivalent to minimizing:

\begin{equation}\mathcal{L}_{\text{aux}} = \frac{1}{2}\Vert\boldsymbol{F} - \boldsymbol{Q}\Vert^2 = \frac{1}{2}\sum_{i=1}^n (F_i - 1/n)^2\end{equation}

This objective is non-differentiable. However, based on our experience from the previous post, we know that STE (Straight-Through Estimator) can solve this. The key to STE is finding a differentiable quantity that share the same trend of increase or decrease as $\boldsymbol{F}$ to serve as a smooth approximation. Here, our only optimization parameter is $\boldsymbol{b}$, and it possesses exactly the property we desire (increasing $b_i$ increases the probability of $i$ being selected, thus increasing $F_i$). So the answer reveals itself:

\begin{equation}\mathcal{L}_{\text{aux}} = \frac{1}{2}\Vert\boldsymbol{b} + \text{sg}[\boldsymbol{\rho}-\boldsymbol{b}] - \boldsymbol{Q}\Vert^2 = \frac{1}{2}\sum_{i=1}^n (b_i + \text{sg}[F_i - b_i] - 1/n)^2\end{equation}

Its gradient is:

\begin{equation}\nabla_{\boldsymbol{b}}\mathcal{L}_{\text{aux}} = \frac{1}{2}\nabla_{\boldsymbol{b}}\Vert\boldsymbol{b} + \text{sg}[\boldsymbol{F}-\boldsymbol{b}] - \boldsymbol{Q}\Vert^2 = \boldsymbol{F} - \boldsymbol{Q}\end{equation}

So, updating $\boldsymbol{b}$ using Stochastic Gradient Descent (SGD) gives:

\begin{equation}\boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha (\boldsymbol{F} - \boldsymbol{Q})\end{equation}

Where $\alpha$ is the learning rate for $\boldsymbol{b}$. However, the rule eventually chosen by Loss-Free is slightly different; it chooses Sign Gradient Descent (SignSGD):

\begin{equation}\boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha \mathop{\text{sign}}(\boldsymbol{F} - \boldsymbol{Q})\label{eq:aux-loss-free}\end{equation}

This result is also easy to understand: if $F_i$ is larger than $1/n$, then decrease $b_i$ a little; otherwise, increase $b_i$ a little.

Improved Version

In addition to SignSGD, I found that applying RMS Norm (i.e., Normalized SGD) directly to $\boldsymbol{F} - \boldsymbol{Q}$ often achieves better balancing effects with the same $\alpha$:

\begin{equation}\boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha\frac{\boldsymbol{F} - \boldsymbol{Q}}{\text{RMS}(\boldsymbol{F} - \boldsymbol{Q})}\end{equation}

Where RMS stands for "Root Mean Square," defined as:

\begin{equation}\text{RMS}(\boldsymbol{F} - \boldsymbol{Q}) = \sqrt{\frac{1}{n}\sum_{i=1}^n (F_i - Q_i)^2}\end{equation}

It is not hard to see that both $\mathop{\text{sign}}(\boldsymbol{F} - \boldsymbol{Q})$ and Normalized $\frac{\boldsymbol{F} - \boldsymbol{Q}}{\text{RMS}(\boldsymbol{F} - \boldsymbol{Q})}$ have an RMS of 1. Thus, they are roughly on the same scale, allowing us to use the same $\alpha$.

In short, the problem with $\mathop{\text{sign}}$ is that it uses the same update magnitude regardless of how close $F_i$ is to the target $Q_i$. This causes $F_i$ that are already close to $Q_i$ to oscillate and deviate from balance. RMS Norm, however, preserves the relative relationship between $F_i - Q_i$, making the update magnitude more adaptive. Theoretically, this is more conducive to promoting balance, and in practice, it usually performs better.

In the Same Lineage

When the original paper introduced Loss-Free, it did not include the Aux Loss derivation provided above. Instead, it directly presented the update rule in Equation \eqref{eq:aux-loss-free}, giving the impression that they "handcrafted" the gradient $\mathop{\text{sign}}(\boldsymbol{F} - \boldsymbol{Q})$ for $\boldsymbol{b}$. This is where the name "Loss-Free" originates.

However, from the derivation given in this post, we can see that the update rule \eqref{eq:aux-loss-free} can be derived entirely from an Aux Loss perspective. The two are deeply connected. It seems the most direct benefit of Loss-Free is that there's no need to tune the Aux Loss weight, although it actually still has a learning rate parameter $\alpha$. Although the original paper has already searched for the default value of $\alpha=0.001$, it cannot be denied that this hyperparameter exists.

In my opinion, the essential innovation of Loss-Free is not the absence of an Aux Loss, but the isolation of the optimization parameters of the Aux Loss and the LM Loss, thereby achieving both load balance and model capability. The most critical step is noticing the fact that "a single bias term is sufficient to achieve load balance." Then, the Aux Loss is allowed to optimize only the newly introduced bias $\boldsymbol{b}$, while the LM Loss optimizes the remaining parameters, minimizing the negative impact of the Aux Loss on the LM Loss.

In contrast, conventional Aux Loss schemes require all parameters to promote load balance, while LM Loss also optimizes all parameters. The optimization directions of the two might not be fully compatible, making it relatively more difficult to find an optimal balance point. Thus, Loss-Free's isolation of the two loss functions' optimization parameters based on "one bias term is enough" is a brilliant solution to the load balancing problem.

Related Details

Although Loss-Free is simple and clear enough, some details should be noted when using it.

First, for each batch of data, we should first update the model parameters based on the LM Loss, and then update $\boldsymbol{b}$ according to Equation \eqref{eq:aux-loss-free}. This is because the update of $\boldsymbol{b}$ depends on the aggregate statistics $\boldsymbol{F}$ of all tokens in the batch. If we update $\boldsymbol{b}$ before the rest of the model parameters, there is technically a risk of leaking future information. Although it seems unlikely that a single vector $\boldsymbol{b}$ would leak much information, the risk exists, and it is better to avoid it.

Second, we mentioned that the original paper tuned $\alpha=0.001$, but this result might be coupled with the choice of using Sigmoid as the Router activation function $\boldsymbol{\rho}$. The reason is simple: after Sigmoid, each $\rho_i$ is relatively independent and within $(0,1)$. $\alpha=0.001$ implies an update magnitude of about one-thousandth per step. If Softmax, ReLU, or other activation functions are used, $\alpha$ might need to be retuned.

To address this problem, I suggest decoupling the activation functions used for the Gate and the Bias, i.e.:

\begin{equation}\boldsymbol{y} = \sum_{i\in \mathop{\text{argtop}}_k \boldsymbol{\rho} + \boldsymbol{b}} \rho_i \boldsymbol{e}_i\qquad\to\qquad \boldsymbol{y} = \sum_{i\in \mathop{\text{argtop}}_k \boldsymbol{\rho}^{(\sigma)} + \boldsymbol{b}} \rho_i^{(h)} \boldsymbol{e}_i\end{equation}

Where $\boldsymbol{\rho}^{(\sigma)} = \sigma(\boldsymbol{x}\boldsymbol{W}^{(R)})$ and $\boldsymbol{\rho}^{(h)} = h(\boldsymbol{x}\boldsymbol{W}^{(R)})$. Here $\sigma(\cdot)$ is the Sigmoid function, and $h(\cdot)$ is any monotonic function with a non-negative range. Simply put, the Sigmoid-activated score is what is added to $\boldsymbol{b}$, allowing us to reuse $\alpha=0.001$. As for the Gate multiplied by the Expert, we can use other activation functions as long as their monotonicity is consistent with Sigmoid.

Additionally, because the update rule \eqref{eq:aux-loss-free} includes the $\text{sign}$ function, it is possible for the absolute value of $b_i$ to exceed 1 or for the overall absolute values to keep increasing. These are normal and will not affect the model's performance. In fact, $\boldsymbol{b}$ has a redundant degree of freedom, since the result of $\mathop{\text{argtop}}_k \boldsymbol{\rho} + \boldsymbol{b}$ remains unchanged if a constant is added to all $b_i$. This extra degree of freedom can be used for other interesting things (to be discussed in a later installment).

Further Reflections

Beyond MoE load balancing, the idea of Loss-Free can be applied to many similar problems. For example, Codebook Collapse in VQ-VAE can be solved using the same approach. Compared to the previously introduced "Rotation Trick" or "Linear Transformation Trick," it appears more natural and universal. In fact, the evaluation at the beginning of this article—that "the potential academic influence of Loss-Free may far surpass other works"—is precisely based on this consideration of its universality.

Setting aside specific application contexts and looking at it mathematically, the contribution of Loss-Free can be understood as providing a way to solve the assignment problem using gradient descent. A classic linear assignment problem can be expressed as:

\begin{equation}\min_f \sum_{i=1}^n c_{i, f(i)}\end{equation}

Where $c_{i,j}$ is a given cost function, and $f$ is a bijection from $\{1,2,\cdots,n\}$ to itself. In the context of this post, $c_{i,j}$ corresponds to the scores of $n$ tokens and $n$ experts, and the sought-after $f$ is exactly a balanced allocation scheme. The general idea for solving such problems is to search for the most optimal solution within the space that satisfies the constraints. Loss-Free, on the other hand, works backward. It first constructs an optimal solution that might not satisfy the constraints:

\begin{equation}f(i) = \mathop{\text{argmin}}_j c_{i,j}\end{equation}

This solution is definitely optimal in terms of score, but it is not necessarily a bijection (which, in this context, is equivalent to load imbalance). Thus, we introduce a bias:

\begin{equation}f(i) = \mathop{\text{argmin}}_j c_{i,j} + b_j\end{equation}

$\boldsymbol{b}_j$ is initialized to zero and then updated according to Equation \eqref{eq:aux-loss-free}. The update rule, simply put, is: if $j$ appears too frequently, decrease the corresponding $b_j$; otherwise, increase it until a bijection is achieved.

Summary

This article introduced the Loss-Free method for the MoE load balancing problem proposed by DeepSeek. Its core lies in achieving load balance by introducing a simple bias term. This post further explored its connection with Aux Loss and its application potential in similar mathematical problems.