Fitting One-Dimensional Probability Density Functions with Fourier Series

By 苏剑林 | March 07, 2024

In "Musings on Multi-modal Approaches (Part 1): Lossless Input", we mentioned that the fundamental difficulty of image generation is that there is no universal fitter for continuous probability densities. Of course, one cannot say there are none at all; for example, Gaussian Mixture Models (GMM) can theoretically fit any probability density, and even GANs can essentially be understood as GMMs mixing an infinite number of Gaussian models. However, although GMM's theoretical capacity is sufficient, its Maximum Likelihood Estimation is very difficult, especially because it is usually not suitable for gradient-based optimizers, which limits its application scenarios.

Recently, a new paper from Google, "Fourier Basis Density Model", proposed a new solution for the one-dimensional case—using Fourier series for fitting. The analysis process in the paper is quite interesting, and the construction form is very clever, making it well worth studying.

Problem Description

Some readers might question: what is the value of studying only the one-dimensional case? Indeed, if only image generation scenarios are considered, its value might be limited. However, one-dimensional probability density estimation itself has its own application value, such as in lossy data compression, making it a subject worth researching. Furthermore, even if we need to study multi-dimensional probability densities, they can be converted into multiple one-dimensional conditional probability densities through autoregression. Finally, the analysis and construction process itself is very much worth contemplating, so even as a exercise in mathematical analysis, it is quite beneficial.

To the point. One-dimensional probability density estimation refers to estimating the probability density function of a sampling distribution given $n$ real numbers $x_1, x_2, \dots, x_n$ sampled from the same distribution. Methods for this problem can be divided into "non-parametric estimation" and "parametric estimation." Non-parametric estimation primarily refers to Kernel Density Estimation (KDE), which, similar to "Using the Dirac Function to Construct Smooth Approximations of Non-smooth Functions", essentially uses the Dirac function for smooth approximation:

However, $\delta(x - y)$ is not a conventional smooth function, so we need to find a smooth approximation for it, such as $\frac{e^{-(x-y)^2/2\sigma^2}}{\sqrt{2\pi}\sigma}$, which is the probability density function of a normal distribution. As $\sigma \to 0$, it becomes $\delta(x-y)$. Substituting this into the above equation results in "Gaussian Kernel Density Estimation."

As can be seen, kernel density estimation essentially involves memorizing the data, so it can be inferred that its generalization ability is limited. Additionally, its complexity is proportional to the amount of data, making it impractical when the data scale is large. Therefore, we need "parametric estimation," which involves constructing a family of probability density functions $p_{\theta}(x)$ that are non-negative, normalized, and have a fixed number of parameters $\theta$, and then solving via maximum likelihood:

The key to this problem is how to construct a family of probability density functions $p_{\theta}(x)$ with sufficient fitting capacity.

Existing Methods

The classic method for parametric probability density estimation is the Gaussian Mixture Model (GMM). Its starting point is also Equation $\eqref{eq:delta}$ plus a Gaussian kernel, but instead of traversing all data as centers, it finds a finite number $K$ of means and variances through training to construct a model whose complexity is independent of the data volume:

GMM is also commonly understood as a clustering method, an extension of K-Means, where $\mu_1, \mu_2, \dots, \mu_K$ are cluster centers, and unlike K-Means, it includes variances $\sigma_1, \sigma_2, \dots, \sigma_K$. Due to the theoretical guarantee of Equation $\eqref{eq:delta}$, GMM can theoretically fit any complex distribution when $K$ is large enough and $\sigma$ is small enough, making GMM's theoretical capacity sufficient. However, the problem with GMM is that the form $e^{-(x-\mu_i)^2/2\sigma_i^2}$ decays too quickly, leading to severe gradient disappearance. Consequently, it can usually only be solved using the EM algorithm (refer to "The Three Senses of Capsules: Matrix Capsules and EM Routing"), which is difficult to optimize with gradient-based optimizers, limiting its application. Furthermore, even with the EM algorithm, one usually only finds a sub-optimal solution.

The original paper also mentions a method called DFP (Deep Factorized Probability), designed specifically for one-dimensional probability density estimation, originating from the paper "Variational image compression with a scale hyperprior". DFP utilizes the monotonicity of the cumulative probability function, i.e., the following integral:

must be monotonically increasing. If we can first construct a monotonically increasing function $\Phi_{\theta}(x): \mathbb{R}\mapsto[0,1]$, then its derivative will be a valid probability density function. How to ensure the output of a model increases monotonically with its input? DFP uses a multi-layer neural network to build the model and ensures: 1. all activation functions are monotonically increasing; 2. all multiplicative weights are non-negative. Under these two constraints, the model must exhibit monotonicity. Finally, adding a Sigmoid controls the output range to $[0,1]$.

The principle of DFP is simple and intuitive, but similar to flow-based models, this layer-by-layer constraint raises concerns about lost fitting capacity. Furthermore, since the model is built entirely from neural networks, according to conclusions such as "Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks", neural networks prioritize learning low-frequency signals during training. This may result in poor fitting of probability density peaks, i.e., over-smoothed fitting results. However, in many scenarios, the ability to fit peaks is one of the important indicators for evaluating a probability modeling method.

Enter Fourier

As indicated by the paper title "Fourier Basis Density Model" (hereafter referred to as FBDM), the proposed new method is based on using Fourier series to fit probability densities. Truth be told, the idea of using Fourier series for fitting is not difficult to come by; what is difficult is that there is a key non-negativity constraint that is uneasy to construct, and FBDM successfully navigated all the relevant details, which is truly admirable.

For simplicity, we set the domain of $x$ to $[-1,1]$. Since any real number on $\mathbb{R}$ can be mapped to $[-1,1]$ through transformations like $\tanh$, this setting does not lose generality. That is to say, the probability density function $p(x)$ we seek is a function defined on $[-1,1]$. We can write it as a Fourier series:

However, what we are doing now is the reverse: we do not know $p(x)$ to find its Fourier series; instead, we only have samples of $p(x)$. We need to set $p(x)$ in the following truncated Fourier series form and find the coefficients $c_n$ using an optimizer:

As is well known, as a reasonable probability density function, $f_{\theta}(x)$ must satisfy at least the following two conditions:

The problem is that if $c_n$ are arbitrary real/complex numbers, then even the reality of Equation $\eqref{eq:fourier-series}$ cannot be guaranteed, let alone non-negativity. Thus, using Fourier series for fitting is not hard to think of; the difficulty lies in setting the form of $c_n$ such that the corresponding output is necessarily non-negative (as we will see later, under Fourier series, non-negativity is the hardest part, while normalization is actually simple).

Ensuring Non-negativity

In this section, we look at the most difficult part of the paper: the construction of non-negativity. The difficulty here is not that the derivation is complex, but that it is highly skilled.

First, we know that reality is a prerequisite for non-negativity, and ensuring reality is relatively simple:

It can be proven conversely that this condition is also sufficient. Therefore, guaranteeing reality only requires the constraint $c_n^* = c_{-n}$. This is relatively easy to implement and means that when the series in Equation $\eqref{eq:fourier-series}$ serves as a probability density function, there are only $N+1$ independent parameters $c_0, c_1, \dots, c_N$.

Regarding the non-negativity constraint, the original paper simply throws out a reference to "Herglotz’s theorem" and provides the answer directly, with almost no derivation. I searched for Herglotz’s theorem and found few introductions, so I had to try to bypass Herglotz’s theorem and understand the paper's answer in my own way.

Consider any finite subset of integers $\mathbb{K}\subseteq\mathbb{N}$ and the corresponding arbitrary complex sequence $\{u_k|k\in\mathbb{K}\}$. We have:

The final $\geq 0$ depends on $p(x) \geq 0$, so we have derived a necessary condition for $f_{\theta}(x) \geq 0$: $\sum_{n,m\in \mathbb{K}} u_n^* c_{n-m} u_m \geq 0$. If we arrange all $c_{n-m}$ into a large matrix (a Toeplitz matrix), then in linear algebra terms, this condition is that any submatrix formed by the intersection of rows and columns corresponding to $\mathbb{K}$ must be a positive semi-definite matrix in complex space. It can also be proven that this condition is sufficient; that is, an $f_{\theta}(x)$ satisfying this condition must be greater than or equal to 0.

So the problem turns into finding a complex sequence $\{c_n\}$ such that the corresponding Toeplitz matrix $\{c_{n-m}\}$ is a positive semi-definite matrix. While this transformation seems to make the problem more complex, readers familiar with time series might know a ready-made construction: "autocorrelation coefficients." For any complex sequence $\{a_k\}$, the autocorrelation coefficient is defined as:

It can be proven that $r_{n-m}$ is necessarily positive semi-definite:

Therefore, taking $c_n$ in the form of $r_n$ is a feasible solution. To ensure that $c_n$ has only $N+1$ independent parameters, we agree that $a_k=0$ when $k < 0$ or $k > N$. We then get:

This constructs the corresponding $c_n$ such that the resulting Fourier series in Equation $\eqref{eq:fourier-series}$ is necessarily non-negative, satisfying the requirement for the probability density function.

General Results

Thus far, the most difficult part of the whole problem—"non-negativity"—has been solved. The remaining normalization is simple because:

So the normalization factor is $2c_0$! Therefore, we only need to set $p_{\theta}(x)$ as:

This is a valid probability density candidate function. Of course, currently this distribution is only defined on $[-1,1]$. We need to extend it to the entire $\mathbb{R}$. This is not difficult; we first think of a transformation that compresses $\mathbb{R}$ to $[-1,1]$, and then find the form of the probability density after the transformation. To this end, we can first estimate the mean $\mu$ and variance $\sigma^2$ from the original samples, then use $x=\frac{y-\mu}{\sigma}$ to turn it into a distribution with mean 0 and variance 1, and then compress it into $[-1,1]$ through $x=\tanh\left(\frac{y-\mu}{\sigma}\right)$. The corresponding new probability density function is:

From an end-to-end learning perspective, we can directly substitute the original data into the log-likelihood of $q_{\theta}(y)$ for optimization (instead of optimizing after compression), and even treat $\mu, \sigma$ as trainable parameters to be adjusted together.

Finally, to prevent overfitting, we also need a regularization term. The purpose of regularization is to hope the fitted distribution can be slightly smoother and not over-fit to local details. To this end, we consider the square of the norm of the derivative of $f_{\theta}(x)$ as the regularization term:

As seen from the final form, it increases the penalty weight for high-frequency coefficients, which helps prevent the model from overfitting high-frequency details and improves generalization.

Extended Thoughts

At this point, we have completed all the theoretical derivations of FBDM. The rest is naturally conducting experiments. We will not repeat that part; just look at the results in the original paper. Note that all coefficients and operations of FBDM are in the complex domain. Forcing them to be real would significantly complicate the resulting forms, so for simplicity, they should be implemented directly based on the complex operations of the framework used (the original paper uses Jax).

Before looking at the experimental results, let's think about the evaluation metrics. In simulation experiments, we usually know the probability density expression of the true distribution. Therefore, the most direct metric is to calculate the KL divergence/Wasserstein distance between the true distribution and the fitted distribution. In addition, for fitting probability densities, we are usually more concerned with the fitting effect of the "peaks." Assuming the probability density is smooth, it may have multiple local maximum points, which we refer to as peaks or "modals." In many scenarios, the ability to accurately find more modals is more important than the overall distribution metric. For example, the basic idea of lossy compression is to retain only these modals to describe the distribution.

From the experimental results in the original paper, it can be seen that FBDM performs better in terms of KL divergence and modals under the same number of parameters:

Comparison of GMM, DFP, and FBDM effects

I suspect the reason for this advantage is that the complex exponential form of FBDM is essentially trigonometric functions. Unlike the negative exp in GMM, it does not have the serious problem of gradient disappearance. Therefore, gradient-based optimizers have a higher probability of finding a better solution. From this perspective, FBDM can also be understood as a "Softmax" for continuous probability densities, both using $\exp$ as a basis to build a probability distribution, except one uses real exponents and the other uses imaginary exponents.

Compared to GMM, FBDM naturally has some disadvantages, such as being less intuitive, difficult to sample from, and difficult to generalize to high dimensions (though these disadvantages are also present in DFP). GMM is very intuitive, being a weighted average of a finite number of normal distributions. Sampling can be achieved through a hierarchical sampling process of category sampling followed by normal sampling. In contrast, FBDM does not have such an intuitive scheme; it seems sampling can only be done via the inverse cumulative probability function. That is, first find the cumulative probability function:

Actually, the integration is quite simple:

Then $y=\Phi^{-1}(\varepsilon), \varepsilon\sim U[0,1]$ can achieve sampling. Regarding high-dimensional generalization, normal distributions inherently have high-dimensional forms, so generalizing GMM to any dimension is easy. But if FBDM is to be directly generalized to $D$ dimensions, then there will be $(N+1)^D$ parameters, which is obviously too complex. Or, like Decoder-Only LLMs, use autoregression to convert it into multiple one-dimensional conditional probability density estimations. In short, it's not that there are no ways, but one has to deal with more twists and turns.

Summary

This article introduced a new idea for modeling one-dimensional probability density functions using Fourier series. The key is to constrain the Fourier series—originally in the complex domain—to become a non-negative function through ingenious coefficient construction. The whole process is quite pleasing and well worth learning.