Narrow Streams of Flow: Invertible ResNet—The Ultimate Brute-Force Aesthetics

By 苏剑林 | March 21, 2019

Today, we introduce a very "brute-force" model: Invertible ResNet. Why can a model be described as "brute-force"? Naturally, it’s because it truly is: it synthesizes many mathematical techniques to forcibly (under certain constraints) transform a conventional ResNet model into an invertible one!

A comparison diagram between standard ResNet and Invertible ResNet. Invertible ResNet allows for lossless, reversible information flow, whereas standard ResNet exhibits "collapse" at certain points. The model comes from the paper "Invertible Residual Networks", which was previously reported by Synced (Machine Heart). In this article, let’s take a simple look at its principles and content.

Bit by Bit on Invertible Models

Why study Invertible ResNet models? What are the benefits? Has no one studied them before?

Benefits of Invertibility

What does invertibility mean? It means it is information-lossless, implying it might be used to create better classification networks, and it can be used to build generative models directly using maximum likelihood. Moreover, thanks to the powerful capability of ResNet, it might perform better than previous Glow models. In short, if a model is invertible, the cost of inversion is low, and the fitting capacity is strong, then it has a wide range of applications (classification, density estimation, generative tasks, etc.).

The Invertible ResNet introduced here basically satisfies these requirements: it is relatively simple to invert and essentially does not change the fitting capability of ResNet. Therefore, I believe it deserves the title of a "beautiful" model.

Limitations of Old Models

Research on invertible models has existed for a long time; they are known as "flow-based models." Representative models include NICE, RealNVP, and Glow. I have written articles introducing them, and there are also some autoregressive flow models. Besides generative models, there is also research on using invertible models for classification tasks, with representative works being RevNet and i-RevNet.

The design philosophy of these flow models is basically the same: through clever design, each layer's inverse transformation is simple, and the Jacobian matrix is triangular, making the Jacobian determinant easy to compute. While such models are theoretically elegant and beautiful, there is a serious problem: because the inverse must be simple and the Jacobian determinant easy to compute, the non-linear transformation capability of each layer is very weak. In fact, in models like Glow, only half of the variables are transformed in each layer. Thus, to ensure sufficient fitting capacity, the model must be stacked very deep (for example, for 256x256 face generation, the Glow model stacks about 600 convolutional layers with 200 million parameters), resulting in massive computational costs.

Hard-Clashing with Residual Models

This Invertible ResNet is different from previous flow models. It simply adds some constraints to the ordinary ResNet structure to make the model invertible, essentially retaining the basic structure and most of the fitting capability of ResNet. Consequently, our previous experience in designing ResNets can still be applied, and the model no longer needs to pile up convolutions desperately like Glow.

Of course, this comes at a price. Because there is no special design, we need quite "brute-force" methods to obtain its inverse function and Jacobian determinant. Therefore, this Invertible ResNet is beautiful, but also very brute-force—it can be called "The Ultimate Brute-Force Aesthetics."

The "Three Elements" of Invertibility

The basic module of a ResNet model is:

\begin{equation}y = x + g(x)\triangleq F(x)\label{eq:resnet}\end{equation}

In other words, while we originally wanted to use a neural network to fit $y$, it has now changed to using a neural network to fit $y-x$, where both $x$ and $y$ are vectors (tensors). The benefit of this is that gradients are less likely to vanish, deep networks can be trained, and information can be transmitted through multiple channels. "Invertible" means $x+g(x)$ is a one-to-one mapping, meaning for every $x$ there is only one $y$, and vice versa. In theory, we can solve for the inverse function $x=h(y)$.

We won't go into more background, but it should be noted that the broad ResNet used in classification allows changing dimensions through $1 \times 1$ convolutions, but the ResNet here refers to a ResNet that does not change dimensions, meaning the size of $x$ and $y$ remains the same.

For a model claiming to be "invertible," it must answer three questions:

When is it invertible?
What is the inverse function?
How is the Jacobian determinant calculated?

From a difficulty perspective, these three questions are progressively more challenging. Of course, if you only care about classification, fulfilling the first point is sufficient; if you care about reconstructing images, you need the second; if you want to use it for maximum likelihood training of generative models like Glow, you need the third.

Next, we will follow the logic of the original paper to solve these three questions (three "main courses"), creating a feast of brute force.

When is it Invertible?

The first course is relatively easy to digest among the three—though only "relatively," as it still utilizes foundational knowledge from functional analysis.

Since $\eqref{eq:resnet}$ is the basic module of ResNet, we only need to ensure each module is invertible. A sufficient condition for $\eqref{eq:resnet}$ to be invertible is:

\begin{equation}\text{Lip}(g) < 1\end{equation}

where

\begin{equation}\text{Lip}(g) \triangleq \max_{x_1\neq x_2}\frac{\Vert g(x_1) - g(x_2)\Vert_2}{\Vert x_1 - x_2\Vert_2}\end{equation}

is the Lipschitz norm of the function $g$. That is to say, if the Lipschitz norm of $g$ is less than 1, $\eqref{eq:resnet}$ is guaranteed to be invertible.

So when is the Lipschitz norm of $g$ less than 1? Since $g$ is a neural network—whether convolutional or fully connected—it is composed of matrix operations and activation functions. A representative structure is:

\begin{equation}Activation(Wx+b)\end{equation}

By the chain rule, a sufficient condition for "the Lipschitz norm of $g$ being less than 1" is that "the Lipschitz norm of $Activation$ does not exceed 1" and "the Lipschitz norm of $Wx+b$ is less than 1." Since $Activation$ is a scalar function, "Lipschitz norm not exceeding 1" means its derivative does not exceed 1. Currently common activation functions (sigmoid, tanh, relu, elu, swish, etc.) all satisfy this, so we don't need to worry about it. "The Lipschitz norm of $Wx+b$ is less than 1" means that the Lipschitz norm of the matrix $W$ is less than 1.

The Lipschitz norm of a matrix $W$ is actually the "spectral norm," denoted as $\text{Lip}(W)$ or $\Vert W\Vert_2$. When did matrix spectral norms appear? We discussed them in "Lipschitz Constraints in Deep Learning: Generalization and Generative Models." Combining the two, the conclusion is: By performing spectral normalization on all kernel weights $W$ of the model $g$ and then multiplying by a coefficient $c$ between 0 and 1 (i.e., $W \leftarrow c W / \Vert W\Vert_2$), we can make $x+g(x)$ invertible.

What is the Inverse Function?

Why does this make it invertible? The proof process directly answers the second question; that is, we solve for the inverse function directly, which reveals the conditions for invertibility.

Suppose $y=x+g(x)$ is invertible. We want to find a way to solve for the inverse function $x=h(y)$, which is essentially solving a system of nonlinear equations. For simplicity, we consider the iteration:

\begin{equation}x_{n+1}=y - g(x_n)\label{eq:nidiedai}\end{equation}

Obviously, the iterative sequence $\{x_n\}$ is related to $y$. Once $\{x_n\}$ converges to a fixed function

\begin{equation}\lim_{n\to\infty} x_n(y) = \hat{h}(y)\end{equation}

we have $\hat{h}(y)=y-g\left(\hat{h}(y)\right)$, which means $\hat{h}(y)$ is the $x=h(y)$ we wish to find.

In other words, if the iteration $\eqref{eq:nidiedai}$ converges, the result is the inverse function of $x+g(x)$. So we just need to figure out when $\eqref{eq:nidiedai}$ converges. This is where the condition $\text{Lip}(g) < 1$ comes in. We have:

\begin{equation}\forall x_1,x_2,\quad\Vert g(x_1) - g(x_2)\Vert_2\leq \text{Lip}(g)\Vert x_1 - x_2\Vert_2\end{equation}

Thus:

\begin{equation}\begin{aligned}\Vert x_{n+1} - x_{n}\Vert_2&=\Vert g(x_{n}) - g(x_{n-1})\Vert_2\\ &\leq \text{Lip}(g)\Vert x_{n} - x_{n-1}\Vert_2\\ & = \text{Lip}(g)\Vert g(x_{n-1}) - g(x_{n-2})\Vert_2\\ &\leq \text{Lip}(g)^2\Vert x_{n-1} - x_{n-2}\Vert_2\\ &\dots\\ &\leq \text{Lip}(g)^n\Vert x_{1} - x_{0}\Vert_2\\ \end{aligned}\end{equation}

It can be seen that a sufficient condition for $\Vert x_{n+1} - x_{n}\Vert_2 \to 0$ is $\text{Lip}(g) < 1$.

Note: Merely stating $\Vert x_{n+1} - x_{n}\Vert_2 \to 0$ does not prove that the sequence $\{x_n\}$ converges. For example, the sequence $\{\ln n\}$ satisfies this condition but diverges. Therefore, to prove $\{x_n\}$ converges when $\text{Lip}(g) < 1$, we need to do a bit more work. Since this is quite mathematical and some readers might not be interested, it is provided as an appendix.

For any positive integer $k$, we consider $\Vert x_{n+k} - x_{n}\Vert_2$:

\begin{equation}\begin{aligned}\Vert x_{n+k} - x_{n}\Vert_2&\leq\Vert x_{n+k} - x_{n+k-1}\Vert_2+\dots+\Vert x_{n+2} - x_{n+1}\Vert_2+\Vert x_{n+1} - x_{n}\Vert_2\\ &\leq \left(\text{Lip}(g)^{n+k-1}+\dots+\text{Lip}(g)^{n+1}+\text{Lip}(g)^{n}\right)\Vert x_{1} - x_{0}\Vert_2\\ & = \frac{1 - \text{Lip}(g)^k}{1 - \text{Lip}(g)}\cdot\text{Lip}(g)^{n}\Vert x_{1} - x_{0}\Vert_2\\ & \leq \frac{\text{Lip}(g)^n}{1 - \text{Lip}(g)}\Vert x_{1} - x_{0}\Vert_2 \end{aligned}\label{eq:cauchy}\end{equation}

We see that we have obtained an upper bound for $\Vert x_{n+k} - x_{n}\Vert_2$ that only depends on $n$ and can be arbitrarily small. That is, for any $\varepsilon > 0$, we can find an $n$ such that for any positive integer $k$, $\Vert x_{n+k} - x_{n}\Vert_2 < \varepsilon$. Such a sequence is called a Cauchy sequence, and it necessarily converges. Thus, we have finally proven the convergence of $\{x_n\}$.

Incidentally, in $\eqref{eq:cauchy}$, letting $k \to \infty$, we get:

\begin{equation}\left\Vert x^* - x_{n}\right\Vert_2 \leq \frac{\text{Lip}(g)^n}{1 - \text{Lip}(g)}\Vert x_{1} - x_{0}\Vert_2\end{equation}

This means the convergence rate of this iterative algorithm is proportional to $\text{Lip}(g)^n$. Naturally, the smaller $\text{Lip}(g)$, the faster the convergence; however, the smaller $\text{Lip}(g)$, the weaker the model's fitting capacity. In the original paper, its range is 0.5 to 0.9.

To put it grandly, this is essentially the "Banach Fixed Point Theorem" from functional analysis, also known as the "Contraction Mapping Theorem" (because $\text{Lip}(g)$ is less than 1, $g$ is called a contraction mapping).

So now we have answered why $\text{Lip}(g) < 1$ makes $x+g(x)$ invertible, and we have provided a method for finding the inverse function—iterating until sufficient precision is reached using $\eqref{eq:nidiedai}$: After performing normalization to make $x+g(x)$ invertible, its inverse function is the fixed point of $x_{n+1}=y - g(x_n)$. In numerical computation, one simply iterates for a certain number of steps until accuracy requirements are met.

Finally, we have finished the second course.

How to Calculate the Jacobian Determinant?

Now we arrive at the most "hardcore" of the three questions: How to calculate the Jacobian determinant? To solve it, the authors synthesized mathematical tools from analysis, matrix theory, probability theory, and statistical sampling. It is the pinnacle of "brute force" and the "hardest course."

First, why calculate the Jacobian determinant? As mentioned earlier, this is only necessary for generative models. For details, please refer to this site's earliest introduction to flow models: "Narrow Streams of Flow: Basic Concepts and Implementation of NICE." Next, we know the Jacobian determinant is the determinant of the Jacobian matrix, so we need to calculate the Jacobian matrix:

\begin{equation}J_F\triangleq \frac{\partial}{\partial x}(x+g(x))= I + \frac{\partial g}{\partial x}\triangleq I + J_g\end{equation}

A quick reminder: although I've been lazy and haven't used bold, $g$ outputs a vector and $x$ is a vector. $\partial g / \partial x$ is the pairing of every input and output for partial differentiation, resulting in a matrix (the Jacobian matrix).

Then, the Jacobian determinant is $\det(J_F)=\det (I+J_g)$. However, in practice for generative models, what we really need is the "logarithm of the absolute value of the Jacobian determinant," i.e.,

\begin{equation}\ln |\det(J_F)| = \ln |\det(I + J_g)|\equiv \ln \det(I + J_g)\end{equation}

The last identity holds because $\det(I + J_g)$ is guaranteed to be positive. This is provable but a detail; we won't dwell on it. Readers can check the references provided by the authors.

What then? Calculate the Jacobian determinant directly by definition? No, the computational cost would be too high, and calculating the derivative of the determinant during backpropagation would be even more complex. The authors came up with a tedious but effective solution using the identity (refer to "Appreciation of the identity det(exp(A)) = exp(Tr(A))"):

\begin{equation}\ln\det(\boldsymbol{B}) = \text{Tr}(\ln (\boldsymbol{B}))\end{equation}

We get:

\begin{equation}\ln\det(I + J_g) = \text{Tr}(\ln (I+J_g))\end{equation}

If we can find $\ln (I+J_g)$, we can just take the trace (the sum of diagonal elements). How to find $\ln (I+J_g)$? Referencing the same article, we use a brute-force power series expansion:

\begin{equation}\ln (I + J_g) = \sum_{n=1}^{\infty}(-1)^{n-1}\frac{J_g^n}{n}\label{eq:duishujishu}\end{equation}

Note that the condition for this series to converge is $\Vert J_g\Vert_2 < 1$, which corresponds exactly to $\text{Lip}(g) < 1$—the very constraint of Invertible ResNet. It is perfectly consistent.

Now $\ln (I + J_g)$ has become an infinite series. If we truncate it to $n$ terms, the error is proportional to $\text{Lip}(g)^n$, so the number of terms depends on $\text{Lip}(g)$. Thus, we can write:

\begin{equation}\text{Tr}(\ln (I + J_g)) = \sum_{n=1}^{N}(-1)^{n-1}\frac{\text{Tr}(J_g^n)}{n}+\mathcal{O}\left(\text{Lip}(g)^N\right)\label{eq:det2tr}\end{equation}

Is the problem solved? The above formula requires us to calculate $J_g^n$. Since $J_g$ is a matrix, we'd need to compute the $n$-th power of a matrix (think about the work involved in just two matrix multiplications). So the authors thought: Since the analytical tools are exhausted, let’s use probability and statistics.

Suppose $p(u)$ is a multivariate distribution with a mean of 0 and an identity covariance matrix (standard normal distribution fits this). For any matrix $A$, we have:

\begin{equation}\text{Tr}(A)=\mathbb{E}_{u\sim p(u)}\big[u^{\top}Au\big]\end{equation}

Using the "zero mean, identity covariance" property, this can be proven directly by definition without much difficulty. Then, the authors proposed a method that seems "brazen yet reasonable": for each iteration, I will randomly pick only one vector $u$ from $p(u)$ and treat $u^{\top}Au$ as the $\text{Tr}(A)$, i.e.,

\begin{equation}\text{Tr}(\ln (I + J_g)) \approx \sum_{n=1}^{N}(-1)^{n-1}\frac{u^{\top} J_g^n u}{n},\quad u\sim p(u)\label{eq:tr-caiyang}\end{equation}

Readers might object: shouldn't we average over many vectors? Is randomly picking just one okay? It actually is, for the following reasons:

Our optimization is based on Stochastic Gradient Descent (SGD), which inherently involves noise/error. Picking just one $u$ also introduces error, and by picking different $u_1, u_2$ at each iteration, the errors cancel out to some extent.
More importantly, we are calculating the log-determinant of the Jacobian merely as an additional loss to prevent the model from collapsing. Simply put, it can be viewed as a regularization term, and since it is regularization, a bit of error is acceptable.

Thus, the calculation of $\text{Tr}(J_g^n)$ was solved by the authors with this crude (yet effective) scheme. Note that:

\begin{equation}u^{\top} J_g^n u=u^{\top} J_g(\dots(J_g(J_g u)))\end{equation}

Therefore, there is no need to compute $J_g^n$ explicitly; each step only requires calculating the product of a matrix and a vector, and these steps can be reused. Consequently, the computational cost is greatly reduced.

In summary: By performing a logarithmic power series expansion $\eqref{eq:duishujishu}$ on the Jacobian matrix, converting the determinant calculation into a trace calculation $\eqref{eq:det2tr}$, and utilizing probabilistic sampling $\eqref{eq:tr-caiyang}$, the Jacobian determinant can be calculated most efficiently.

Overview of Experimental Results

I was initially attracted by the beautiful concept of "Invertible ResNet," but seeing this, I realized I’m intimidated—it truly deserves the reputation of "hard-clashing with ResNet." I originally wanted to implement a MNIST generation example for fun, but after confirming so many techniques and such "brutality," I gave up.

So, let's look at the experimental results from the original paper.

Toy Dataset

First is a synthetic Toy dataset, which involves constructing points with certain patterns and then using a generative model to fit the distribution. GANs often perform such experiments. As seen in the figure below, Invertible ResNet performs much better than Glow. Ultimately, I believe this is because Invertible ResNet is a very symmetric model without much bias, whereas Glow is biased—it requires us to shuffle inputs and split them in half, after which the operations on both halves are different, leading to asymmetry.

(Image: Invertible ResNet Experiment: Toy Dataset)

Classification Tasks

As mentioned at the start, since it is a variant of ResNet, its basic use is classification. The table below shows that using Invertible ResNet for classification yields very good results; the existence of Lipschitz constraints does not significantly impact classification performance (where $c$ in the table is $\text{Lip}(g)$).

(Table: Invertible ResNet Experiment: Classification Performance)

Generative Model Experiments

Flow models are still far behind GANs in generating complex images, but quantitative comparisons between them are fine. The figure below indicates that Invertible ResNet is also excellent as a flow-based generative model.

(Image: Invertible ResNet Experiment: Generative Model Performance)

Time to Wrap Up

Forcing a clash with ResNet like this is truly a laborious task. Just interpreting it has made me tired; I truly admire the authors' mathematical prowess. Of course, the authors succeeded in the end, and the joy of success must be very fulfilling. Overall, the entire work is brute-force, but upon careful savoring, there is no sense of disharmony; rather, there is a natural sense of beauty within, not just a simple piling of mathematical formulas.

The only issue is that the entire "clashing" process is quite complex; therefore, widespread adoption requires good encapsulation, which often deters people. Another issue is that flow models, to ensure invertibility, cannot reduce dimensionality. Without dimension reduction, high computational cost is inevitable—this is a contradiction.

An interesting thought: for cases with dimension reduction, could we create a model similar to the "pseudoinverse" of a matrix to achieve an effect similar to Invertible ResNet? Non-square matrices can also have determinants (for example, "Revisiting the Determinant of Non-Square Matrices"), so dimension-reducing transformations should also be able to produce a log-Jacobian determinant. It seems many aspects could be generalized.

Where will the flow model series go next? Let’s wait and see.

Citation:

\begin{verbatim} , author={苏剑林}, year={2019}, month={Mar}, url={\url{https://kexue.fm/archives/6482}}, } \end{verbatim}