By 苏剑林 | August 26, 2019
A few days ago, I saw a push on Jiqizhixin (Machine Heart) titled "Completely Solving the Gradient Explosion Problem: A New Method for Training ResNet Without Backpropagation". Of course, we can temporarily ignore the clickbait style of the media headline and focus on the content. The article introduces the results of the paper "The HSIC Bottleneck: Deep Learning without Back-Propagation", which proposes an algorithm for training neural networks through the HSIC Bottleneck.
To be honest, I haven't quite understood this paper yet because there are too many new concepts for me. However, the concept of "HSIC" in the paper caught my interest. After study, I have basically understood the meaning and origin of HSIC. Thus, this article aims to give as intuitive (though perhaps imprecise) an understanding of HSIC as possible.
HSIC stands for "Hilbert-Schmidt independence criterion." Like mutual information, it can be used to measure the independence between two variables.
We know that the basic form of mutual information is: \begin{equation}I(X,Y)=\iint p(x,y)\log \frac{p(x, y)}{p(x)p(y)}dxdy\label{eq:i}\end{equation} If $I(X,Y)=0$, it means that $p(x, y)\equiv p(x)p(y)$, which implies the two variables are independent; otherwise, they are correlated. However, the term $\log \frac{p(x, y)}{p(x)p(y)}$ means we need to estimate probability densities in some way.
The role of HSIC is similar to mutual information, but unlike mutual information, it does not require estimating the probability density of the two variables. Instead, it transforms the problem directly into a sampling form.
Long-time readers of this blog know that "mutual information" is a frequently occurring concept here. We can use mutual information for new word discovery (e.g., "New Word Discovery Based on Segmentation"), and for unsupervised learning (e.g., "Mutual Information in Deep Learning: Extracting Features Unsupervised"). The importance of mutual information is evident. If there is an indicator that can replace mutual information and is even more convenient, it is definitely something I must learn.
Generally, we define the problem as follows:
Given data $(x_1, y_1),(x_2, y_2),\dots,(x_n,y_n)\sim p(x, y)$, determine whether $p(x, y)$ is identically equal to $p(x)p(y)$, i.e., whether $x$ and $y$ are independent.
Strictly speaking, for continuous variables, "identically equal" here means "equal almost everywhere," but we will not strictly enforce that terminology here.
For the sake of formal description, let $x\in X, y\in Y$ and $f(x),g(y)\in \mathbb{R}$. Note that $x$ and $y$ might be variables with completely different meanings. For example, $x$ might be "Monday" and $y$ might be "at work"; $p(x,y)$ would then be the probability that "today is Monday AND I am at work today." Because of this, $X$ and $Y$ might be two completely different domains.
The basic logic is to calculate the mutual information in \eqref{eq:i}, but in many problems, we cannot estimate probabilities or probability densities well. One possible solution is to transform it into a dual problem using an adversarial-like approach to learn mutual information (the infomax logic), but this method can be unstable and is affected by the sampling scheme. The best solution would be to have an indicator similar to the "correlation coefficient" that allows us to explicitly calculate and optimize it.
HSIC was designed for this purpose!
Here we aim to introduce the concept of HSIC as clearly as possible. However, "as clear as possible" does not mean "as short as possible." In fact, the following section will still be quite long and contain many mathematical formulas. However, compared to standard tutorials that immediately introduce Hilbert spaces, reproducing kernels, and various operators, this introduction should be friendly to readers unfamiliar with those concepts.
HSIC notes that:
$p(x, y)\equiv p(x)p(y)$ if and only if for any $f,g$, the expression \begin{equation}\begin{aligned}C[f,g]=&\iint p(x,y)f(x)g(y)dxdy - \iint p(x)p(y)f(x)g(y)dxdy\\ =&\mathbb{E}_{(x,y)\sim p(x,y)}[f(x)g(y)]-\mathbb{E}_{x\sim p(x)}[f(x)]\mathbb{E}_{y\sim p(y)}[g(y)]\end{aligned}\end{equation} is equal to 0.
This conclusion is clearly not hard to understand. What's interesting is that the right side of the second equals sign is in the form of an expectation (sampling), meaning we have transformed the indicator into a sampling form, avoiding direct probability density estimation.
Thus, we have a way to judge independence: select "enough" pairs of $f, g$, then calculate \begin{equation}L_H=\sum_{f,g} \big(C[f,g]\big)^2\label{eq:l0}\end{equation} and see how close $L_H$ is to 0. Conversely, if in an optimization problem we want features $x$ and $y$ to be as independent as possible, we can add $L_H$ to the loss function.
In fact, the form of $L_H$ already reflects the diagnostic logic of HSIC. Let's follow this logic and gradually move towards the final form of HSIC.
First, let's expand $\big(C[f,g]\big)^2$: \begin{equation}\begin{aligned}\big(C[f,g]\big)^2=&\big(\mathbb{E}_{(x,y)\sim p(x,y)}[f(x)g(y)]\big)^2 + \big(\mathbb{E}_{x\sim p(x)}[f(x)]\big)^2 \big(\mathbb{E}_{y\sim p(y)}[g(y)]\big)^2\\ & - 2\big(\mathbb{E}_{(x,y)\sim p(x,y)}[f(x)g(y)]\big)\big(\mathbb{E}_{x\sim p(x)}[f(x)]\big)\big(\mathbb{E}_{y\sim p(y)}[g(y)]\big)\end{aligned}\end{equation}
Then we use a trick: we know that $\mathbb{E}_{x\sim p(x)}[f(x)]=\mathbb{E}_{x'\sim p(x')}[f(x')]$, which shows that the result of the expectation has nothing to do with the notation of the random variable. So we have: \begin{equation}\begin{aligned}\big(\mathbb{E}_{x\sim p(x)}[f(x)]\big)^2=&\big(\mathbb{E}_{x_1\sim p(x)}[f(x_1)]\big)\big(\mathbb{E}_{x_2\sim p(x)}[f(x_2)]\big)\\ =&\mathbb{E}_{x_1\sim p(x),x_2\sim p(x)}[f(x_1)f(x_2)]\end{aligned}\end{equation}
By transforming all other terms in this way, we finally get: \begin{equation}\begin{aligned}\big(C[f,g]\big)^2=&\mathbb{E}_{(x_1,y_1)\sim p(x,y),(x_2,y_2)\sim p(x,y)}[f(x_1)f(x_2)g(y_1)g(y_2)] \\ & + \mathbb{E}_{x_1\sim p(x),x_2\sim p(x),y_1\sim p(y),y_2\sim p(y)}[f(x_1)f(x_2)g(y_1)g(y_2)]\\ & - 2 \mathbb{E}_{(x_1,y_1)\sim p(x,y),x_2\sim p(x),y_2\sim p(y)}[f(x_1)f(x_2)g(y_1)g(y_2)]\end{aligned}\label{eq:c}\end{equation} In this way, every term is an expectation of $f(x_1)f(x_2)g(y_1)g(y_2)$, just with different sampling distributions for the variables.
Now the question is: which $f, g$ should we choose? How many are "enough"?
Analogous to knowledge of vector spaces, all possible $f(x)$ form a vector space $\mathcal{F}$, and all possible $g(y)$ similarly form a vector space $\mathcal{G}$. If we can traverse all the "bases" of these two spaces once, that should certainly be enough. The question then becomes: how to find all the bases?
This is where the "kernel function" comes into play. A kernel function is—well, actually it's complicated to explain, and I don't fully understand it myself. Simply put, a kernel function is similar to a "positive definite matrix" in linear algebra. It is a binary symmetric function $K(x_1, x_2)=K(x_2, x_1)$ defined on $X\times X$. If we treat the unary function $f(x)$ as a vector, then: $$ \int K(x_1,x_2) f(x_2)dx_2 $$ is equivalent to a matrix multiplication operation (matrix times vector). Just like eigenvalues and eigenvectors of a matrix, a kernel function can define eigenvalues and eigenfunctions. A unary function $\psi$ that satisfies the following identity is called an eigenfunction of this kernel function: \begin{equation}\int K(x_1,x_2) \psi(x_2)dx_2=\alpha \psi(x_1)\end{equation}
The above content is preparatory; the strict definitions belong to the category of "Reproducing Kernel Hilbert Spaces" (RKHS). What we will actually use are two properties:
1. All eigenfunctions $\psi_1, \psi_2, \dots$ of a kernel function constitute an orthogonal basis for that space;
2. All eigenvalues $\alpha_1, \alpha_2, \dots$ of the kernel function are positive and satisfy \begin{equation}K(x_1,x_2)=\sum_i \alpha_i \psi_i(x_1)\psi_i(x_2)\label{eq:k}\end{equation}
With the above preparation, HSIC can basically make its debut!
First, suppose we have a kernel function $K_X(x_1,x_2)$ defined on $X\times X$. Then we can calculate the eigenvalues $\alpha_1, \alpha_2, \dots$ and eigenfunctions $\psi_1, \psi_2, \dots$ corresponding to $K_X(x_1,x_2)$. Similarly, with a kernel function $K_Y(y_1,y_2)$ defined on $Y\times Y$, we can calculate the eigenvalues $\beta_1, \beta_2, \dots$ and eigenfunctions $\phi_1, \phi_2, \dots$ corresponding to $K_Y(y_1,y_2)$.
Then, because eigenfunctions constitute a basis, in \eqref{eq:l0}, we can replace $f, g$ with the corresponding eigenfunctions $\psi_i, \phi_j$: \begin{equation}L_H=\sum_{i,j}\big(C[\psi_i, \phi_j]\big)^2\end{equation} Since all eigenvalues are positive, we can also use the eigenvalues as weights for the summation without changing the purpose of $L_H$: \begin{equation}L_H=\sum_{i,j}\alpha_i \beta_j\cdot\big(C[\psi_i, \phi_j]\big)^2\end{equation} Now, substituting \eqref{eq:c} into the above, we get: \begin{equation}\begin{aligned}L_H=&\mathbb{E}_{(x_1,y_1)\sim p(x,y),(x_2,y_2)\sim p(x,y)}\left[\sum_{i,j}\alpha_i \beta_j\psi_i(x_1)\psi_i(x_2)\phi_j(y_1)\phi_j(y_2)\right] \\ & + \mathbb{E}_{x_1\sim p(x),x_2\sim p(x),y_1\sim p(y),y_2\sim p(y)}\left[\sum_{i,j}\alpha_i \beta_j\psi_i(x_1)\psi_i(x_2)\phi_j(y_1)\phi_j(y_2)\right]\\ & - 2 \mathbb{E}_{(x_1,y_1)\sim p(x,y),x_2\sim p(x),y_2\sim p(y)}\left[\sum_{i,j}\alpha_i \beta_j\psi_i(x_1)\psi_i(x_2)\phi_j(y_1)\phi_j(y_2)\right] \end{aligned}\end{equation}
Finally, applying the equality in \eqref{eq:k}, the terms inside the square brackets are exactly $K_X(x_1,x_2)K_Y(y_1,y_2)$. Thus, HSIC makes its entrance: \begin{equation}\begin{aligned}HSIC(X,Y)=&\mathbb{E}_{(x_1,y_1)\sim p(x,y),(x_2,y_2)\sim p(x,y)}\left[K_X(x_1,x_2)K_Y(y_1,y_2)\right] \\ & + \mathbb{E}_{x_1\sim p(x),x_2\sim p(x),y_1\sim p(y),y_2\sim p(y)}\left[K_X(x_1,x_2)K_Y(y_1,y_2)\right]\\ & - 2 \mathbb{E}_{(x_1,y_1)\sim p(x,y),x_2\sim p(x),y_2\sim p(y)}\left[K_X(x_1,x_2)K_Y(y_1,y_2)\right]\end{aligned}\label{eq:hsic}\end{equation} This is the indicator for measuring correlation that we were looking for. It is purely in sampling form, and $K_X, K_Y$ are usually given and differentiable kernels. Therefore, this is an indicator that can be calculated via explicit sampling and directly optimized!
In actual calculation, there are many optional kernel functions. A commonly used one is: \begin{equation}K(x_1, x_2) = \exp\left(-\frac{\Vert x_1 - x_2\Vert_2^2}{\sigma^2}\right)\label{eq:gk}\end{equation} where $\sigma > 0$ is a constant. The paper mentioned at the beginning, "The HSIC Bottleneck: Deep Learning without Back-Propagation", also uses this kernel. Different kernel functions yield slightly different effects, but all of them can guarantee $HSIC(X,Y)=0 \Leftrightarrow p(x,y)\equiv p(x)p(y)$.
Finally, let's derive the matrix form of \eqref{eq:hsic} for a finite sample.
According to the idea of calculating expectation via sampling, $\mathbb{E}_{(x,y)\sim p(x,y)}$ is actually an average over all sample pairs $(x_i, y_i)$, and $\mathbb{E}_{(x_1,y_1)\sim p(x,y),(x_2,y_2)\sim p(x,y)}$ is simply performing this averaging operation twice. Thus: \begin{equation}\mathbb{E}_{(x_1,y_1)\sim p(x,y),(x_2,y_2)\sim p(x,y)}\left[K_X(x_1,x_2)K_Y(y_1,y_2)\right]=\frac{1}{n^2}\sum_{i=1}^n \sum_{j=1}^n \left[K_X(x_i,x_j)K_Y(y_i,y_j)\right]\end{equation} Where $K_X(x_i,x_j)$ and $K_Y(y_i,y_j)$ are $n\times n$ symmetric matrices, denoted as $K_X, K_Y$. Then the above operation can be written as the matrix multiplication $\frac{1}{n^2}\text{Tr}(K_X K_Y)$, where $\text{Tr}$ represents the trace of the matrix. Based on the same logic, the second term is actually "the average of all elements in $K_X$ times the average of all elements in $K_Y$." If written in matrix form, it's $\frac{1}{n^4}\text{Tr}(K_X \boldsymbol{1}K_Y \boldsymbol{1})$, where the bold $\boldsymbol{1}$ denotes an $n\times n$ matrix of all ones. Correspondingly, the last term is "$1/n$ times twice the average value of all elements in $K_X K_Y$," namely $\frac{2}{n^3}\text{Tr}(K_X K_Y \boldsymbol{1})$.
So, expressing HSIC in matrix form gives: \begin{equation}\begin{aligned}HSIC(X,Y)=&\frac{1}{n^2}\text{Tr}(K_X K_Y)+\frac{1}{n^4}\text{Tr}(K_X \boldsymbol{1}K_Y \boldsymbol{1})-\frac{2}{n^3}\text{Tr}(K_X K_Y \boldsymbol{1})\\ =&\frac{1}{n^2}\text{Tr}(K_X J K_Y J) \end{aligned}\end{equation} Here $J = \boldsymbol{I}-\boldsymbol{1}/n$, where $\boldsymbol{I}$ is the identity matrix of order $n$. Similar to the discussion in "Brief Introduction to Unbiased and Biased Estimation", this is actually a biased estimate. Replacing the preceding $1/n$ with $1/(n-1)$ gives the unbiased estimate: \begin{equation}HSIC(X,Y)=\frac{1}{(n-1)^2}\text{Tr}(K_X J K_Y J)\label{eq:hsic-m}\end{equation} This is the final HSIC formula in matrix form (note that the $1/n$ inside $J$ does not need to be replaced with $1/(n-1)$).
Here I provide a reference implementation and perform a simple experiment to verify the effectiveness of HSIC. Then, in the next section, we will reflect on potential issues with HSIC.
If the kernel matrices $K_X, K_Y$ are already known, the reference implementation for calculating HSIC is as follows:
import numpy as np
def hsic(Kx, Ky):
Kxy = np.dot(Kx, Ky)
n = Kxy.shape[0]
h = np.trace(Kxy) / n**2 + np.mean(Kx) * np.mean(Ky) - 2 * np.mean(Kxy) / n
return h * n**2 / (n - 1)**2
Note that this implementation is based on the meaning of each term in \eqref{eq:hsic}, rather than the matrix form in \eqref{eq:hsic-m}. In fact, the matrix form in \eqref{eq:hsic-m} is not efficient because it involves three matrix multiplications.
Below is a simple experiment to verify the effectiveness of HSIC:
# Generate two sets of independent random variables
x = np.random.randn(1000)
y = np.random.randn(1000)
Kx = np.expand_dims(x, 0) - np.expand_dims(x, 1)
Kx = np.exp(- Kx**2) # Calculate kernel matrix
Ky = np.expand_dims(y, 0) - np.expand_dims(y, 1)
Ky = np.exp(- Ky**2) # Calculate kernel matrix
print(hsic(Kx, Ky)) # Calculate HSIC
The output result is around 0.0002. If we change $x, y$ to:
x = np.random.randn(1000)
y = x + 0.1 * np.random.randn(1000)
This means $x$ and $y$ have a strong correlation, and the HSIC result reflects this, being approximately 0.096—more than two orders of magnitude larger than 0.0002. This shows that HSIC is indeed effective. (Note: HSIC output values generally only have meaning in relative comparison; their absolute values don't have a distinct intrinsic meaning.)
Clearly, the HSIC calculation result given by \eqref{eq:hsic} depends on the choice of the kernel function. Regardless of which kernel function is used, theoretically it can be guaranteed that: \begin{equation}HSIC(X,Y)=0 \Leftrightarrow p(x,y)\equiv p(x)p(y)\end{equation} But the question is, when $HSIC(X,Y) > 0$, exactly how correlated are $X$ and $Y$?
This depends heavily on the choice of kernel function and the background of the original problem. From the form of the common kernel function \eqref{eq:gk}, we can roughly sense that the kernel function is equivalent to the similarity between two samples. The problem is what kind of similarity definition truly fits the context of the problem; there is no standard answer, and this is usually a very difficult problem.
For example, suppose $x_1, x_2, x_3$ represent three images. We know that $\Vert x_1 - x_2\Vert_2 = 0$ means the two images $x_1, x_2$ are exactly the same. But when $\Vert x_1 - x_2\Vert_2$ and $\Vert x_1 - x_3\Vert_2$ are both non-zero, we cannot say that image $x_2$ "looks" more like $x_1$ than $x_3$ just because $\Vert x_1 - x_2\Vert_2 < \Vert x_1 - x_3\Vert_2$, because the $\Vert\cdot\Vert_2$ norm is not a perfect metric for our visual perception.
In fact, I believe this is a common drawback of all kernel methods: kernel methods can only guarantee that when a certain indicator is 0, it matches our ideal pursuit, but when the indicator is not 0, it cannot perfectly measure the distance between us and the ideal. A good metric should be carefully designed according to the specific problem, or automatically learned from the data set through methods like GANs.
Of course, this is not to say that HSIC has no value. The value of HSIC lies in the fact that it can be used as an auxiliary objective for optimization. Just as we train an image autoencoder, even if we use GAN logic, we usually also use MSE between the original and reconstructed images as an auxiliary loss.
In summary, this article has introduced the concept of HSIC in a relatively intuitive and straightforward manner. The introduction involved some mathematical content but omitted rigorous mathematical definitions and proofs, keeping only the core parts. I believe this treatment makes it easier for most readers to accept. For those seeking strict rigor, please be understanding.