Cleverly Stopping Gradients: Implementing GAN Models with a Single Loss

By 苏剑林 | February 22, 2019

We know that for ordinary models, we simply build the architecture, define the loss, and throw it to the optimizer for training. However, GANs are different; generally, they involve two different losses that need to be optimized alternately. The current mainstream approach is to train the discriminator and the generator in a 1:1 alternating frequency (each trained once, with different learning rates aka TTUR if necessary). Alternating optimization means we need to pass data twice (from memory to GPU), and perform forward and backward propagation twice.

If we could combine these two steps into one single optimization step, it would definitely save time. This is known as the synchronous training of GANs.

(Note: This article is not introducing a new type of GAN, but rather a new way to write GANs. This is a programming problem, not an algorithmic one~)

If in TensorFlow

If you are using TensorFlow, implementing synchronous training is not difficult because we have already defined the training ops for the discriminator and the generator (let's assume they are D_solver and G_solver). We can then simply execute:

sess.run([D_solver, G_solver], feed_dict={...})

This is possible because we can individually obtain the parameters for the discriminator and the generator and directly manipulate sess.run.

A More Universal Method

But what if we are using Keras? Keras has already encapsulated the workflow, and generally, we cannot manipulate it with such fine granularity. Therefore, below we introduce a universal trick: we only need to define a single loss and pass it to the optimizer to achieve GAN training. At the same time, from this trick, we can learn how to manipulate losses more flexibly to control gradients.

Discriminator Optimization

Let's take the hinge loss of a GAN as an example. Its form is:

\begin{equation}\begin{aligned}D =& \mathop{\text{argmin}}_D \mathbb{E}_{x\sim p(x)}\big[\max\big(0, 1 + D(x)\big)\big]+\mathbb{E}_{z\sim q(z)}\big[\max\big(0, 1 - D(G(z))\big)\big]\\ G =& \mathop{\text{argmin}}_G \mathbb{E}_{z\sim q(z)}\big[D(G(z))\big] \end{aligned}\end{equation}

Note that $\mathop{\text{argmin}}_D$ implies that $G$ should be fixed, because $G$ itself has optimizable parameters. If it weren't fixed, it would be $\mathop{\text{argmin}}_{D,G}$.

To fix $G$, besides the method of "removing $G$'s parameters from the optimizer," we can also use stop_gradient to manually fix it:

\begin{equation}D,G = \mathop{\text{argmin}}_{D,G} \mathbb{E}_{x\sim p(x)}\big[\max\big(0, 1 + D(x)\big)\big]+\mathbb{E}_{z\sim q(z)}\big[\max\big(0, 1 - D(G_{ng}(z))\big)\big]\label{eq:dg-d}\end{equation}

Here,

\begin{equation}G_{ng}(z)=\text{stop\_gradient}(G(z))\end{equation}

As a result, in Eq. \eqref{eq:dg-d}, although we have allowed both $D$ and $G$ weights to be updated, by continuously optimizing Eq. \eqref{eq:dg-d}, only $D$ will change, while $G$ will not. This is because we use a gradient-descent-based optimizer, and $G$'s gradient has been stopped. In other words, we can understand it as $G$'s gradient being forced to $0$, so its update amount is always $0$.

Generator Optimization

Now that the optimization of $D$ is solved, what about $G$? stop_gradient can easily let us fix the gradients of the inner parts (like $G(z)$ in $D(G(z))$), but the optimization of $G$ requires us to fix the outer $D$, for which there is no direct function. However, do not be discouraged; we can use a mathematical trick to transform it.

First, we must be clear: we want the gradient of $G$ inside $D(G(z))$, and we do not want the gradient of $D$. If we directly take the gradient of $D(G(z))$, we get gradients for both $D$ and $G$. What if we take the gradient of $D(G_{ng}(z))$? We only get the gradient of $D$, because $G$ has been stopped. Now, here is the important part: by subtracting these two, don't we get purely the gradient of $G$!

\begin{equation}D,G = \mathop{\text{argmin}}_{D,G} \mathbb{E}_{z\sim q(z)}\big[D(G(z)) - D(G_{ng}(z))\big]\label{eq:dg-g}\end{equation}

Now, by optimizing Eq. \eqref{eq:dg-g}, $D$ will not change, but $G$ will change.

Note: This formulation should not be understood through the chain rule, but through the inherent meaning of stop_gradient itself. For $L(D,G)$, regardless of the relationship between $G$ and $D$, the full gradient is $(\nabla_D L, \nabla_G L)$. When $G$'s gradient is stopped, it's as if $G$'s gradient is forced to $0$; that is, the gradient of $L(D,G_{ng})$ is actually $(\nabla_D L, 0)$. Therefore, the gradient of $L(D,G) - L(D,G_{ng})$ is $(\nabla_D L, \nabla_G L) - (\nabla_D L, 0) = (0, \nabla_G L)$.

It is worth mentioning that the output of this expression is always identically $0$ because the two parts are the same, and subtracting them naturally results in $0$. However, its gradient is not $0$. That is to say, this is a loss that is identically zero, but its gradient is not identically zero.

Synthesizing a Single Loss

Alright, now Eq. \eqref{eq:dg-d} and Eq. \eqref{eq:dg-g} both allow $D$ and $G$ to be optimized simultaneously, and both are $\text{argmin}$ problems. Therefore, we can merge these two steps into a single loss:

\begin{equation}\begin{aligned}D,G = \mathop{\text{argmin}}_{D,G}&\,\mathbb{E}_{x\sim p(x)}\big[\max\big(0, 1 + D(x)\big)\big]+\mathbb{E}_{z\sim q(z)}\big[\max\big(0, 1 - D(G_{ng}(z))\big)\big]\\ &\, + \lambda\, \mathbb{E}_{z\sim q(z)}\big[D(G(z)) - D(G_{ng}(z))\big]\label{eq:dg-dg}\end{aligned}\end{equation}

By writing out this loss, we can simultaneously complete the optimization of the discriminator and the generator without alternating training. The effect is basically equivalent to 1:1 alternating training. The role of introducing $\lambda$ is equivalent to making the ratio of the learning rates between the discriminator and the generator $1:\lambda$.

Reference Code: https://github.com/bojone/gan/blob/master/gan_one_step_with_hinge_loss.py

Summary

This article introduced a small trick for implementing GANs, which allows us to write only a single model and use a single loss to achieve GAN training. It essentially uses the stop_gradient skill to manually control gradients, which could also be applicable in other tasks.

So, from now on, I will use this method to write GANs—it saves effort and time. Of course, theoretically, this method might consume more GPU memory, which can be seen as sacrificing space for time.