By 苏剑林 | Feb 13, 2020
In the 2018 article "An Introduction to f-GAN: A Production Workshop for GAN Models," I introduced f-GAN and described it as a "production workshop" for GAN models. As the name suggests, this refers to its ability to construct many different forms of GAN models following a fixed procedure. A few days ago, I saw a new paper on arXiv titled "Designing GANs: A Likelihood Ratio Approach" (hereafter referred to as Designing GANs or the original paper). I found it to be doing the same thing as f-GAN, but taking a completely different path (though ultimately arriving at the same destination). The entire paper is quite interesting, so I've decided to share it here.
From "An Introduction to f-GAN: A Production Workshop for GAN Models," we know that the first step of f-GAN is to find a function $f$ that satisfies the following conditions:
1. $f$ is a mapping from non-negative real numbers to real numbers ($R^* \to R$);
2. $f(1) = 0$;
3. $f$ is a convex function.
Once such a function is found, a probability f-divergence can be constructed. Then, a technique called "convex conjugation" is used to transform the f-divergence into another form (the form with a $\max$, generally called the dual form). By minimizing this divergence, a min-max process is obtained, which gives birth to a GAN model. By the way, f-GAN represents a series of GAN models, but it does not include WGAN. However, the derivation of WGAN actually follows similar steps, except it uses the Wasserstein distance as the probability metric, and the method of transforming Wasserstein distance into its dual form is different. For details, refer to "From Wasserstein Distance and Duality Theory to WGAN."
f-GAN is logically sound, but according to the steps it provides, we always need to find an f-divergence first and then transform it into the dual form. The question is: since we only need its dual form, why not analyze it directly in the dual space? I previously raised this question in the article "How about a GAN that doesn't use L-constraints and won't suffer from vanishing gradients?". At that time, I only discussed the proof of probability divergence in the dual space and did not give a method for constructing probability divergence. Designing GANs precisely supplements this point.
In this section, we will explore the ideas and methods in Designing GANs. Unlike the somewhat redundant textbook-style derivation in the original paper, this article will derive the results of Designing GANs through a step-by-step reverse induction process, which I believe is easier to understand. Interestingly, understanding the entire derivation process actually requires only very basic calculus knowledge.
Let's take a probability divergence called Total Variation as an example to get a preliminary sense of the key points of analyzing and deriving probability divergence in the dual space.
First, we have
\begin{equation}|p-q|=\max_{t\in[-1, 1]} (p - q)t=\max_{t\in[-1, 1]} pt - qt\label{eq:tv-base}\end{equation}So for probability distributions $p(x), q(x)$, we also have
\begin{equation}|p(x)-q(x)|=\max_{t(x)\in[-1, 1]} p(x)t(x) - q(x)t(x)\end{equation}Integrating both sides (let's not worry about the interchangeability of integration and $\max$ for now):
\begin{equation}\begin{aligned}\int|p(x)-q(x)|dx=&\, \max_{t(x)\in[-1, 1]} \int \big[p(x)t(x) - q(x)t(x)\big]dx\\ =&\, \max_{t(x)\in[-1, 1]} \mathbb{E}_{x\sim p(x)}[t(x)] - \mathbb{E}_{x\sim q(x)}[t(x)] \end{aligned}\end{equation}The term $\int|p(x)-q(x)|dx$ is called the Total Variation between two probability distributions. Thus, we have derived the dual form of Total Variation through this process. If we fix $p(x)$ and let $q(x)$ approach $p(x)$, we can minimize the Total Variation, i.e.,
\begin{equation}\min_{q(x)}\int|p(x)-q(x)|dx = \min_{q(x)}\max_{t(x)\in[-1, 1]} \mathbb{E}_{x\sim p(x)}[t(x)] - \mathbb{E}_{x\sim q(x)}[t(x)]\label{eq:tv-gan}\end{equation}This yields a GAN form.
Reviewing the entire process, putting aside the a priori knowledge of Total Variation, we can find that the core of the entire process is actually Equation \eqref{eq:tv-base}. With Equation \eqref{eq:tv-base}, everything that follows is natural. So what are the characteristics of Equation \eqref{eq:tv-base}? In fact, it can be generalized as:
Goal 1: Find functions $\phi(t), \psi(t)$ and a certain range $\Omega$ such that \begin{equation}d(p, q) = \max_{t\in\Omega} p\phi(t)+q\psi(t)\end{equation} and $d(p,q)\geq 0$, and $d(p,q)=0 \Leftrightarrow p=q$.
With such $\phi(t), \psi(t)$, we can derive a GAN similar to \eqref{eq:tv-gan}:
\begin{equation}\min_{q(x)}\int d(p(x), q(x)) dx = \min_{q(x)}\max_{t(x)\in \Omega} \mathbb{E}_{x\sim p(x)}[\phi(t(x))] + \mathbb{E}_{x\sim q(x)}[\psi(t(x))]\label{eq:gan}\end{equation}Note that in "Goal 1", $p, q$ are just non-negative real numbers, and $\phi(t), \psi(t)$ are just scalar functions, so the entire goal is purely a single-variable function extremum problem, which should be said to be quite simplified. Furthermore, let $r=q/p \in [0, +\infty)$. It can be converted into a simpler "Goal 2":
Goal 2: Find functions $\phi(t), \psi(t)$ and a certain range $\Omega$ such that \begin{equation}d(r) = \max_{t\in\Omega} \phi(t)+r\psi(t)\end{equation} and the minimum value of $d(r)$ is attained at $r=1$.
Now let's examine Goal 2. For simplicity, we assume that $\phi(t), \psi(t)$ are smooth functions. Thus, the maximum of $\phi(t)+r\psi(t)$ can be solved by differentiation. In fact, such a design is already representative enough; when some points are not smooth, we can approximate them with smooth functions and then take the limit, such as $\text{sign}(x)=\lim_{k\to+\infty}\tanh(kx)$.
Based on this assumption, to find the maximum of $\phi(t)+r\psi(t)$, we first need to differentiate and set it to 0:
\begin{equation}\phi'(t)+r\psi'(t)=0 \quad\Rightarrow\quad r = -\frac{\phi'(t)}{\psi'(t)} \triangleq \omega^{-1}(t)\label{eq:max0}\end{equation}Here we assume the above equation has a unique solution, denoted as $t=\omega(r)$, so finally $-\frac{\phi'(t)}{\psi'(t)}$ is equivalent to $\omega^{-1}(t)$, which is the inverse function of $\omega(r)$ (again, a reminder, this is the inverse function, not the reciprocal). At the same time, since $r\in[0, +\infty)$, $t\in \Omega = \omega([0,\infty))$, which means the range of $t$, $\Omega$, is the codomain of $\omega(r)$. In addition, since it can be inverted, it means $\omega(r)$ must be either strictly monotonically increasing or strictly monotonically decreasing. Without loss of generality, we assume $\omega(r)$ is strictly monotonically increasing, so $\omega^{-1}(t)$ is also strictly monotonically increasing.
A derivative of 0 only indicates that $t$ is an extremum point, but it doesn't guarantee it's a maximum point. Let's determine the conditions for it to be a maximum value. We have:
\begin{equation}\phi'(t)+r\psi'(t) = \big(r-\omega^{-1}(t)\big)\psi'(t)\end{equation}Note that $r-\omega^{-1}(t)$ is strictly monotonically decreasing, so it can only have one zero point, and it starts positive and becomes negative. To make the entire derivative function have this property, we set $\psi'(t) \triangleq \rho(t) > 0$, meaning $\rho(t)$ is always positive. In this way, the derivative of $\phi(t)+r\psi(t)$ is continuous and goes from positive to negative, so $\phi(t)+r\psi(t)$ has only one extremum point, which is the maximum point.
Let's summarize the results so far: we assume $\omega(r)$ is strictly monotonically increasing, and we assume $\rho(t)$ is always positive for $t\in \Omega$, and the following relationships are satisfied:
\begin{equation}\left\{\begin{aligned}\phi'(t)=&\, -\omega^{-1}(t)\rho(t)\\ \psi'(t)=&\, \rho(t)\end{aligned} \right.\end{equation}In this case, $\phi(t)+r\psi(t)$ has a unique maximum point $t=\omega(r)$, and it is also the maximum value point. For Goal 2, this has completed half of the content. Now let's examine whether when $t=\omega(r)$, $d(r)=\phi(\omega(r))+r\psi(\omega(r))$ satisfies the remaining properties as we desired (i.e., the minimum value of $d(r)$ is attained at $r=1$).
Continuing the differentiation:
\begin{equation}d'(r)=\big[\phi'(\omega(r))+r\psi'(\omega(r))\big]\omega'(r)+\psi(\omega(r))=\psi(\omega(r))\label{eq:d-r}\end{equation}where the second equality holds because the part in square brackets is 0 according to Equation \eqref{eq:max0}. Now only the term $\psi(\omega(r))$ remains, and noting that we assumed $\psi'(t)=\rho(t) > 0$, $\psi(t)$ is strictly monotonically increasing. At the same time, we also assumed $\omega(r)$ is strictly monotonically increasing, so the composite function $\psi(\omega(r))$ is also strictly monotonically increasing.
Now we add another assumption: $\psi(\omega(1))=0$, which means $d'(1)=0$, so $r=1$ is an extremum point of $d(r)$. Since $\psi(\omega(r))$ is continuous and strictly monotonically increasing, $d'(r)$ goes from negative to positive. Therefore, $r=1$ is the local minimum point of $d(r)$, and also the absolute minimum point.
At this point, our derivation is complete. The conditions we obtained are:
Conclusion 1: If $\omega(r)$ is strictly monotonically increasing, $\Omega = \omega([0, +\infty))$, $\rho(t)$ is always positive for $t\in \Omega$, and the following relationships are satisfied: \begin{equation}\left\{\begin{aligned}\phi'(t)=&\, -\omega^{-1}(t)\rho(t)\\ \psi'(t)=&\, \rho(t)\end{aligned} \right.\end{equation} along with the condition $\psi(\omega(1))=0$, the functions $\phi(t), \psi(t)$ satisfying these conditions can be used to construct a GAN model as shown in \eqref{eq:gan}.
For example, let's verify the original version of GAN:
\begin{equation}\begin{aligned}&\, \min_{q(x)}\max_{t(x)} \mathbb{E}_{x\sim p(x)}[\log (1-\sigma(t(x)))] + \mathbb{E}_{x\sim q(x)}[\log \sigma(t(x))]\\ =&\, \min_{q(x)}\max_{t(x)} \mathbb{E}_{x\sim p(x)}\left[-\log \left(1+e^{t(x)}\right)\right] + \mathbb{E}_{x\sim q(x)}\left[-\log \left(1+e^{-t(x)}\right)\right] \end{aligned}\label{eq:ori-gan}\end{equation}where $\sigma(\cdot)$ is the sigmoid activation function. The above is a binary cross-entropy where true samples are labeled 0 and fake samples are labeled 1. The right side of the equality is the simplified result. From this perspective, $\phi(t)=-\log \left(1+e^t\right)$, $\psi(t)=-\log \left(1+e^{-t}\right)$. We make a small adjustment and let $\psi(t)=\log 2-\log \left(1+e^{-t}\right)$; obviously, this does not affect the original optimization problem. Now we have $\rho(t)=\psi'(t)=\frac{1}{1+e^t}$, which is clearly always greater than 0, and $\omega^{-1}(t)=-\phi'(t)/\psi'(t)=e^t$, which means $t=\omega(r)=\log r$. This is also clearly strictly monotonically increasing. Finally, verifying $\psi(\omega(1))$, we find it indeed equals 0.
From this example, we can derive two corollaries:
1. The condition $\psi(\omega(1))=0$ is not absolutely necessary, because even if $\psi(\omega(1))=0$ is not satisfied at first, we can add a constant to $\psi(t)$ so that it satisfies $\psi(\omega(1))=0$, and adding a constant will not change the original optimization problem;
2. If we swap the labels, letting real samples be labeled 1 and fake samples 0, then the properties obtained are exactly opposite, meaning the calculated $\rho(t)$ is always negative and $\omega(r)$ is strictly monotonically decreasing. This indicates that Conclusion 1 given in this article is a sufficient condition rather than a necessary condition for forming a GAN.
Different forms of $\rho(t), \omega(r)$ can lead to the same GAN model. For example, if we choose $t=\omega(r)=\frac{r}{r+1}$, then $r=\omega^{-1}(t)=\frac{t}{1-t}$, and if we choose $\rho(t)=\frac{1}{t}$, then:
\begin{equation}\left\{\begin{aligned}\phi'(t)=&\, -\frac{t}{1-t}\times\frac{1}{t}\\ \psi'(t)=&\, \frac{1}{t}\end{aligned} \right.\end{equation}Integrating gives $\phi(t)=\log(1-t)$ and $\psi(t) = \log t$. Furthermore, note that $\Omega = \omega([0, +\infty)) = [0, 1)$, and $\rho(t)=\frac{1}{t}$ excludes $t=0$, so the feasible domain is $(0,1)$ (in fact, for experiments, the boundary points can be ignored). That is, the derived GAN is:
\begin{equation}\min_{q(x)}\max_{t(x)\in (0,1)} \mathbb{E}_{x\sim p(x)}[\log (1-t(x))] + \mathbb{E}_{x\sim q(x)}[\log t(x)]\end{equation}This is equivalent to the original GAN, just without explicitly writing the activation function that makes $t(x)\in (0,1)$.
Let's calculate another example. Choose $t=\omega(r)=\frac{1}{2}\log r$, i.e., $r=\omega^{-1}(t)=e^{2t}$, and choose $\rho(t)=e^{-t}$. We can calculate $\phi(t)=-e^t$ and $\psi(t)=-e^{-t}$. Thus, we get a GAN variant:
\begin{equation}\min_{q(x)}\max_{t(x)} \mathbb{E}_{x\sim p(x)}\left[-e^{t(x)}\right] + \mathbb{E}_{x\sim q(x)}\left[-e^{-t(x)}\right]\end{equation}The original paper also used the above conclusion to calculate many weird GANs. Interested readers can read the original paper on their own; I won't repeat the derivations here.
Is there any connection between the method in this article and the previous f-GAN? Are there any extensions to this method? Here are my own answers.
In the above derivation, in the $\max$ step, we get $d(r)$ where $r=1$ is the minimum point of $d(r)$. Reviewing Goal 1 and substituting $d(r)$ back into Equation \eqref{eq:gan}, we will find that the optimization goal is actually:
\begin{equation}\int p(x) d\left(\frac{q(x)}{p(x)}\right)dx\label{eq:f-gan}\end{equation}Does it look familiar? Yes, it looks like the definition of f-divergence. And reviewing the derivation at \eqref{eq:d-r}, we know the derivative of $d(r)$ is strictly monotonically increasing, which indicates that $d(r)$ is a convex function. So the above equation is indeed an f-divergence! In other words, although the authors of this paper took a seemingly completely different path, their results can actually be derived from f-GAN results; they didn't bring anything new.
So is this paper completely equivalent to f-GAN? Unfortunately, it's not yet. The results of the original paper are only a subset of f-GAN. In other words, f-GAN can derive all the GAN model variants it can derive, but it may not be able to derive all the GAN model variants that f-GAN can.
Because reviewing the entire derivation process, the core idea is to directly generalize the measurement formula of "points" to "functions," such as generalizing $|p-q|=0 \Leftrightarrow p=q$ at the beginning to $\int |p(x)-q(x)|dx=0 \Leftrightarrow p(x)=q(x)$. Because of this idea, all derivation processes can be carried out only under single-variable calculus. But the problem is, not all conclusions of $\int d(p(x),q(x))dx=0 \Leftrightarrow p(x)=q(x)$ mean that $d(p,q)=0 \Leftrightarrow p=q$. For example, the KL divergence is $\int p(x)\log \frac{p(x)}{q(x)}dx = 0$ which means $p(x)=q(x)$, but $p\log\frac{p}{q}=0$ does not necessarily mean $p=q$. Therefore, the original paper at least cannot derive the GAN corresponding to KL divergence.
From this point of view, is the original paper worthless? If we only look at the "product," it is indeed worthless because everything it can produce, f-GAN can produce. But we shouldn't only care about the "product"; sometimes we should also care about the "production process." In fact, I believe the academic value of the original paper lies in providing a reference method for analyzing and discovering GANs directly in the dual space, adding another perspective for our understanding of GAN models.
Whether it is f-GAN or the original paper, the losses of the generator and discriminator of the derived GAN models are in the same form, just in different directions. But in fact, in the GAN variants we currently use more, the losses of the generator and discriminator are not the same. For example, a more commonly used form of the original GAN is:
\begin{equation}\begin{aligned}&\, \max_{t(x)} \mathbb{E}_{x\sim p(x)}[\log (1-\sigma(t(x)))] + \mathbb{E}_{x\sim q(x)}[\log \sigma(t(x))]\\ &\, \min_{q(x)} \mathbb{E}_{x\sim q(x)}[-\log (1-\sigma(t(x)))] \end{aligned}\label{eq:ori-gan-2}\end{equation}Similar examples include LSGAN, Hinge GAN, and so on. So if we only consider variants where the loss of the generator and discriminator are in the same form, it is still not sufficient.
In fact, this paper could have gone one step further and obtained more results than f-GAN. Unfortunately, the authors seem to have run their thinking into a dead end and didn't notice this. In fact, achieving this is very simple: in the above process, through the step of $\max_{t\in\Omega} \phi(t)+r\psi(t)$, we solved for $t = \omega(r)$, and then substituted it back into the original $\phi(t)+r\psi(t)$ to get $d(r)$. But in fact, we don't have to substitute it back into the original expression; we can substitute it into any expression of the form $\alpha(t)+r\beta(t)$. According to Equation \eqref{eq:f-gan} and combined with the requirements for f-divergence listed at the beginning, as long as $d(r)=\alpha(\omega(r))+r\beta(\omega(r))$ is a convex function (the fact that $d(1)=0$ can be achieved through translation), or according to the previous reasoning, $d(r)$ is any function with the minimum at $r=1$. Taken together, this is:
Conclusion 2: If $\omega(r)$ is strictly monotonically increasing, $\Omega = \omega([0, +\infty))$, $\rho(t)$ is always positive for $t\in \Omega$, and the following relationships are satisfied: \begin{equation}\left\{\begin{aligned}\phi'(t)=&\, -\omega^{-1}(t)\rho(t)\\ \psi'(t)=&\, \rho(t)\end{aligned} \right.\end{equation} and the functions $\alpha(t), \beta(t)$ such that $d(r) = \alpha(\omega(r)) + r\beta(\omega(r))$ is a convex function, or such that $d(r)$ is a function with its minimum value at $r=1$. The functions $\phi(t), \psi(t), \alpha(t), \beta(t)$ satisfying these conditions can be used to construct the following GAN model (where $\min_{q(x)}$ has omitted the $\alpha(t)$ part which is unrelated to $q(x)$): \begin{equation}\begin{aligned}\max_{t(x)\in \Omega} &\, \mathbb{E}_{x\sim p(x)}[\phi(t(x))] + \mathbb{E}_{x\sim q(x)}[\psi(t(x))]\\ \min_{q(x)}&\, \mathbb{E}_{x\sim q(x)}[\beta(t(x))] \end{aligned}\end{equation}
Using Conclusion 2 to construct a GAN is quite free. it can construct many examples that f-GAN cannot find because it allows the generator and discriminator losses to be inconsistent.
For example, when calculating the original GAN, we get $t = \omega(r) = \log r$, and $r \log r$ is a convex function, so we can let $\alpha(t) = 0, \beta(t) = t$ (note that $\alpha(t)$ can be 0, while $\beta(t)$ cannot, why?), getting $d(r) = r \log r$. The GAN at this time is:
\begin{equation}\begin{aligned}\max_{t(x)} &\, \mathbb{E}_{x\sim p(x)}[\log (1-\sigma(t(x)))] + \mathbb{E}_{x\sim q(x)}[\log \sigma(t(x))]\\ \min_{q(x)}&\, \mathbb{E}_{x\sim q(x)}[t(x)] \end{aligned}\end{equation}This is a quite useful GAN variant. Also, since $r \log r$ is a convex function, $(1+r) \log(1+r)$ is also convex. Then we can let $\alpha(t) = \beta(t) = \log(1+r) = \log(1+e^t)$, which exactly corresponds to $d(r) = (r+1) \log(1+r)$. And $\log(1+e^t) = -\log(1-\sigma(t))$, so the corresponding GAN at this time is \eqref{eq:ori-gan-2}, which is the more commonly used original version of GAN, better than \eqref{eq:ori-gan}.
Let's give another example. Let $t = \omega(r) = \frac{a + b r}{1 + r}$, assuming $b > a$. Then $\Omega = (a, b)$ (still ignore the boundary points for now), and $r = \omega^{-1}(t) = \frac{t-a}{b-t}$. We take $\rho(t) = 2(b-t)$, so it satisfies the requirement of being always positive. This yields $\phi(t) = -(t-a)^2, \psi(t) = -(t-b)^2$. Then we take $d(r) = \frac{(r-1)^2}{r+1}$, which clearly attains its minimum at $r=1$. Then let $\alpha(t) = \beta(t)$, trying to reverse-engineer the form of $\beta(t)$, i.e., $d(r) = (1+r)\beta(t)$, which yields $\beta(t) = \left(\frac{r-1}{r+1}\right)^2$. Substituting $r = \frac{t-a}{b-t}$ gives $\beta(t) = \left(\frac{2}{b-a}t - \frac{a+b}{b-a}\right)^2$. For simplicity, we can let $b-a = 2$, so $\beta(t) = \left(t - \frac{a+b}{2}\right)^2$. The final GAN form is:
\begin{equation}\begin{aligned}\max_{t(x)\in (a,b)} &\, \mathbb{E}_{x\sim p(x)}\left[-(t-a)^2\right] + \mathbb{E}_{x\sim q(x)}\left[-(t-b)^2\right]\\ \min_{q(x)}&\, \mathbb{E}_{x\sim q(x)}\left[\left(t-\frac{a+b}{2}\right)^2\right] \end{aligned}\end{equation}This is actually LSGAN. Readers might be confused: LSGAN does not have the restriction $t(x) \in (a, b)$, so why is it here? In fact, this restriction can be removed because the optimal solution still falls within $(a, b)$ after removing this restriction.
Obviously, these GAN variants obtained based on the generalized Conclusion 2 are very valuable, and these GAN variants cannot be obtained by f-GAN. Therefore, if the original paper could supplement this part of generalization, it would look very beautiful.
This article shared a paper that designs GAN models directly in the dual space, analyzed its connection with f-GAN, and then I generalized the results of the original paper to allow it to design more diverse GAN models.
Finally, readers might wonder: f-GAN already made so many GANs, and now this paper makes so many more GANs, but in practice, we only use a few anyway. What's the value of making so many? This question is a matter of opinion; this type of work should be more about methodological value, and its practical application value might not be that great. However, I would rather say:
I didn't say it has any value either; I just found it quite interesting~