MoE Grand Tour: 4. More Resources for Difficulty

By 苏剑林 | March 28, 2025

In the previous two articles, we discussed load balancing. In "MoE Grand Tour: 3. A Different Way to Allocate", while introducing the Loss-Free scheme, I left a cliffhanger: the Bias term it introduces has a redundant degree of freedom, which can be used for other interesting things. In this article, we will discuss exactly that.

We know that MoE selects the top $k$ most matching experts for each token for computation, thereby increasing parameter capacity while saving computational costs. However, upon closer reflection, this strategy actually has obvious room for improvement: intuitively, every token has a different level of difficulty. A more reasonable scheme would be to allocate more computational resources to difficult tokens and fewer resources to simple tokens, which might maximize performance under the same limited resources.

The extra degree of freedom in the Bias mentioned earlier happens to provide a simple way to achieve this goal.

Design Philosophy

First, let's review the basic form of MoE: \begin{equation}\boldsymbol{y} = \sum_{i\in \mathop{\text{argtop}}_k \boldsymbol{\rho}} \rho_i \boldsymbol{e}_i\end{equation} Load imbalance is a common problem in MoE training. To address this, researchers proposed Aux Loss, which we introduced in "MoE Grand Tour: 2. Not Worried about Scarcity but Inequality". Additionally, in "MoE Grand Tour: 3. A Different Way to Allocate", we introduced the Loss-Free scheme proposed by DeepSeek, which modifies MoE to: \begin{equation}\boldsymbol{y} = \sum_{i\in \mathop{\text{argtop}}_k \boldsymbol{\rho} + \boldsymbol{b}} \rho_i \boldsymbol{e}_i\end{equation} Then, by adjusting the newly introduced Bias term $\boldsymbol{b}$, load balancing is achieved. To allow each token to select a dynamic number of experts, the approach I propose is to slightly modify the Loss-Free form: \begin{equation}\boldsymbol{y} = \sum_{i\in \mathop{\text{argwhere}} \boldsymbol{\rho} + \boldsymbol{b} > 0} \rho_i \boldsymbol{e}_i\end{equation} That is, any expert satisfying $\rho_i + b_i > 0$ is selected. Thus, the number of experts selected for each token is naturally dynamic, and the need for sorting is eliminated, making it simpler in some ways.

Optimization Goals

The optimization goal for $\boldsymbol{b}$ consists of two parts: first, like Loss-Free, it must achieve load balancing; second, it must control the average number of experts selected per token to be $k$, which we can call budget control. Otherwise, simply setting $b_i = \infty$ would select all experts, which is not what we want.

Load balancing still follows the Loss-Free training method. Define the notation $\boldsymbol{f} = [f_1, f_2, \cdots, f_n]$: \begin{equation}f_i = \left\{\begin{aligned}1, \quad \rho_i + b_i > 0 \\ 0, \quad \rho_i + b_i \leq 0\end{aligned}\right.\end{equation} Then let $\tilde{\boldsymbol{F}}=\mathbb{E}[\boldsymbol{f}]$. Consequently, $\boldsymbol{F} = \tilde{\boldsymbol{F}}/|\tilde{\boldsymbol{F}}|$ is the current expert distribution, where $|\tilde{\boldsymbol{F}}|$ is the sum of the components of $\tilde{\boldsymbol{F}}$. The update formula proposed by Loss-Free is: \begin{equation}\boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha \mathop{\text{sign}}(\boldsymbol{F} - \boldsymbol{Q})\label{eq:aux-loss-free}\end{equation} where $\boldsymbol{Q}=(1/n, 1/n, \cdots, 1/n)$ is the target uniform distribution. As mentioned several times, $\boldsymbol{b}$ has a redundant degree of freedom, evidenced by the fact that adding the same constant to all components of $\boldsymbol{b}$ does not change the sorting order. In this way, we can change the update rule \eqref{eq:aux-loss-free} to: \begin{equation}\boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha \left[\mathop{\text{sign}}(\boldsymbol{F} - \boldsymbol{Q}) - \overline{\mathop{\text{sign}}(\boldsymbol{F} - \boldsymbol{Q})}\right]\label{eq:aux-loss-free-2}\end{equation} Here, a bar over a vector represents the mean of all its components, which is a scalar. A vector minus a scalar means subtracting the scalar from each component. Thus, the resulting $\boldsymbol{b}$ will necessarily satisfy $\overline{\boldsymbol{b}}=0$, without changing the effect of load balancing. Consequently, we can leave this degree of freedom of $\overline{\boldsymbol{b}}$ for budget control.

How to understand this? Obviously, if we add the same positive number to all $b_i$, the probability of satisfying $\rho_i + b_i > 0$ will increase, thereby increasing the total budget. So the approach is simple: first calculate the current average budget, which happens to be $|\tilde{\boldsymbol{F}}|$. If it is greater than $k$, decrease $\boldsymbol{b}$ slightly; otherwise, increase it. Integrated into formula \eqref{eq:aux-loss-free-2}, we get: \begin{equation}\boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha \left[\mathop{\text{sign}}(\boldsymbol{F} - \boldsymbol{Q}) - \overline{\mathop{\text{sign}}(\boldsymbol{F} - \boldsymbol{Q})} + \mathop{\text{sign}}(|\tilde{\boldsymbol{F}}|- k)\right]\label{eq:aux-loss-free-3}\end{equation} If we only want to ensure the budget does not exceed $k$, rather than necessarily equaling $k$, we can change the last term to only subtract when $|\tilde{\boldsymbol{F}}| > k$: \begin{equation}\boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha \left[\mathop{\text{sign}}(\boldsymbol{F} - \boldsymbol{Q}) - \overline{\mathop{\text{sign}}(\boldsymbol{F} - \boldsymbol{Q})} + \mathop{\text{sign}}(\max(|\tilde{\boldsymbol{F}}|- k, 0))\right]\label{eq:aux-loss-free-4}\end{equation}

Attempting Simplification

Readers might have noticed that I intentionally separated "load balancing" and "budget control" into two terms in formula \eqref{eq:aux-loss-free-3}. However, if we look closer, we might find that $\mathop{\text{sign}}(\boldsymbol{F} - \boldsymbol{Q})$ actually already contains information about both balancing and budget. If it's unbalanced, different components have different signs; if the budget is wrong, more components will be positive (or negative).

So, can we just use formula \eqref{eq:aux-loss-free} directly? Let's check. If expert $i$ is used too much ($F_i > Q_i = 1/n$), $b_i$ will decrease, reducing its probability of being selected. This is balancing. If the average number of experts selected is too large, the average value of $F_i$ will naturally increase, so on average, $b_i$ will decrease, which in turn reduces the total budget. This is budget control.

So it seems formula \eqref{eq:aux-loss-free} can already achieve both goals. However, the update rate for balancing and budget control is the same. Considering that if the budget isn't right, "balancing" is somewhat meaningless, it might be better to adjust the budget more aggressively. Therefore, a more general simplified version is: \begin{equation}\boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha \left[\mathop{\text{sign}}(\boldsymbol{F} - \boldsymbol{Q}) + \lambda \mathop{\text{sign}}(|\tilde{\boldsymbol{F}}|- k)\right]\label{eq:aux-loss-free-5}\end{equation} where $\lambda \geq 0$ is a hyperparameter. When $\lambda=0$, it returns to formula \eqref{eq:aux-loss-free}. Formula \eqref{eq:aux-loss-free-5} basically covers all needs.

Initial Method

The "Loss-Free with dynamic expert count" scheme mentioned in this article relies on the absolute value of $\rho_i + b_i > 0$ as the threshold. In the beginning, because the Router hasn't been trained yet, the distribution of $\boldsymbol{\rho}$ is very messy. If $\boldsymbol{b}$ is initialized to 0, it's very likely that zero or almost all experts are selected. Therefore, it's necessary to select a suitable initial value for $\boldsymbol{b}$ before training starts, ensuring that the average budget is around $k$ and the load is roughly balanced.

Balancing is easy: just set $b_i$ to be the same. The focus is on finding a scalar $b$ such that the average budget is $k$. This can be done by sampling some data and using a binary search: \begin{equation}b = \mathop{\text{argmin}}_b \left|\mathbb{E}[\text{number of } i \text{ such that } \rho_i + b > 0] - k\right|\end{equation} A simple reference code for implementation is as follows:

import torch

def b_init(n_experts, k, batch_size, eps=1e-4):
    rho = torch.randn(batch_size, n_experts).sigmoid()
    b1, b2 = -1.0, 0.0
    for _ in range(20):
        b = (b1 + b2) / 2
        if (rho + b > 0).sum(1).mean() if -eps < (rho + b > 0).sum(1).mean() - k < eps:
            break
        if (rho + b > 0).sum(1).mean() > k:
            b2 = b
        else:
            b1 = b
    return b

b_init(32, 4, 1024, 6e-3)

The code assumes Sigmoid activation, so the search range is $[-1, 0]$. If using other activation functions, please adjust accordingly. However, the suggestion here is the same as in "MoE Grand Tour: 3. A Different Way to Allocate": the $\boldsymbol{\rho}$ added to $\boldsymbol{b}$ can consistently use Sigmoid activation, while the $\boldsymbol{\rho}$ multiplied by Experts can use other activations.

Related Work

Before this article, there have been some works attempting dynamic expert count designs for MoE. Below is a list of some works I found, with some brief critiques from a personal aesthetic perspective.

Elementary approaches include AdaMoE and MoE++. They mix some low-compute experts into the expert pool, such as empty experts, identity experts, or constant experts, while encouraging load balancing. When a token selects these simple experts, it is equivalent to selecting fewer standard experts, thus indirectly achieving dynamic counts. The advantage is that this can reuse existing Top-$k$ MoE infrastructure, but it also lacks some flexibility.

Another simple idea is changing Top-$k$ selection to Top-$p$, derived from "Harder Tasks Need More Experts: Dynamic Routing in MoE Models". This transition seems natural but has many problems, such as the inability to accurately control the average budget (because the Top-$p$ proportion becomes very large when $\boldsymbol{\rho}$ approaches a uniform distribution). Thus, the original paper added an entropy loss to make $\boldsymbol{\rho}$ stay away from a uniform distribution. Overall, the problems it introduces feel more significant than the benefits.

A unique approach is Ada-K Routing, which adds a module to predict the number of experts to activate and uses Reinforcement Learning (RL) for training. This is theoretically sound, but introducting RL undoubtedly increases training complexity. DA-MoE uses Attention scores to identify important tokens and allocate more experts to them, but this feels less fundamental, because MoE isn't in principle limited to FFN layers; if applied to Attention itself, there would be no Attention scores to use.

The work most similar in form to the method in this article might be ReMoE. It also bases expert selection on a zero threshold but chooses an Aux Loss approach to achieve load balancing and budget control, while mixing in manually designed gradients for weighting the Aux Loss. Overall, it feels a bit "hand-crafted." This article, however, extends the Loss-Free philosophy, utilizing the extra degree of freedom in $\boldsymbol{b}$ to regulate this threshold, thereby achieving dynamic expert counts with minimal changes.

Summary

This article proposes an MoE design that dynamically selects the number of experts. The main idea is to slightly modify the Loss-Free MoE form and then adjust the update rule of the Bias term, using its extra degree of freedom to simultaneously achieve load balancing and budget control.