By 苏剑林 | August 16, 2025
Today I'll post something light and easy, based on an identity I realized over the last couple of days. This identity is actually quite simple, but at first glance, it feels a bit unexpected, so I am recording it here.
We know that $\relu(x) = \max(x, 0)$, and it is easy to prove the following identity:
\begin{equation}x = \relu(x) - \relu(-x)\end{equation}If $x$ is a vector, this equation is even more intuitive: $\relu(x)$ extracts the positive components of $x$, and $-\relu(-x)$ extracts the negative components of $x$; adding the two together yields the original vector.
The next question is: do activation functions like GeLU and Swish satisfy similar identities? At first glance, it seems they do not, but in fact, they do! We even have a more general conclusion:
Let $\phi(x)$ be any odd function, and let $f(x)=\frac{1}{2}(\phi(x) + 1)x$. Then the following always holds:
\begin{equation}x = f(x) - f(-x)\end{equation}
Proving this conclusion is also quite easy, so I won't elaborate further here. For Swish, we have $\phi(x) = \tanh(\frac{x}{2})$, and for GeLU, we have $\phi(x)=\mathop{\text{erf}}(\frac{x}{\sqrt{2}})$; both are odd functions, so they satisfy the same identity.
The identity above can be written in matrix form as:
\begin{equation}x = f(x) - f(-x) = f(x[1, -1])\begin{bmatrix}1 \\ -1\end{bmatrix}\end{equation}This suggests that when using ReLU, GeLU, Swish, etc., as activation functions, a two-layer neural network has the capacity to degenerate into a single layer. This means they can adaptively adjust the actual depth of the model, which shares a similar logic with how ResNet works. This might be one of the reasons why these activation functions perform better than traditional Tanh, Sigmoid, etc.