By 苏剑林 | July 08, 2021
The normal distribution is one of the most common continuous probability distributions. It is the maximum entropy distribution given a mean and covariance (refer to "Entropy: From Entropy, Maximum Entropy Principle to Maximum Entropy Models (Part II)"). It can also be viewed as a second-order approximation of any continuous distribution, occupying a status equivalent to linear approximation for general functions. From this perspective, the normal distribution can be considered the simplest continuous distribution. Precisely because of its simplicity, analytical solutions can be derived for many estimators.
This article primarily calculates several metrics between two multivariate normal distributions, including KL divergence, Bhattacharyya distance, and Wasserstein distance, all of which have explicit analytical solutions.
Here we briefly review some basic knowledge of the normal distribution. Note that this is only a review and is not intended as an introductory tutorial on the subject.
The normal distribution, also known as the Gaussian distribution, is a continuous probability distribution defined on $\mathbb{R}^n$. Its probability density function is:
\begin{equation}p(\boldsymbol{x})=\frac{1}{\sqrt{(2\pi)^n \det(\boldsymbol{\Sigma})}}\exp\left\{-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^{\top}\boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu})\right\}\end{equation}Here $\boldsymbol{x},\boldsymbol{\mu}\in\mathbb{R}^n$, where $\boldsymbol{\mu}$ is the mean vector (vectors in this article are column vectors by default), and $\boldsymbol{\Sigma}\in\mathbb{R}^{n\times n}$ is the covariance matrix, which must be symmetric and positive definite. As can be seen, a normal distribution is uniquely determined by $\boldsymbol{\mu}$ and $\boldsymbol{\Sigma}$; therefore, its statistics are functions of $\boldsymbol{\mu}$ and $\boldsymbol{\Sigma}$. When $\boldsymbol{\mu}=\boldsymbol{0}, \boldsymbol{\Sigma}=\boldsymbol{I}$, the corresponding distribution is called the "standard normal distribution."
Generally, the basic statistics are the mean and variance, which correspond to the two parameters of the normal distribution:
\begin{equation}\begin{aligned} \mathbb{E}_{\boldsymbol{x}}\left[\boldsymbol{x}\right]=&\int p(\boldsymbol{x}) \boldsymbol{x} dx=\boldsymbol{\mu}\\ \mathbb{E}_{\boldsymbol{x}}\left[(\boldsymbol{x}-\boldsymbol{\mu})(\boldsymbol{x}-\boldsymbol{\mu})^{\top}\right]=&\int p(\boldsymbol{x}) (\boldsymbol{x}-\boldsymbol{\mu})(\boldsymbol{x}-\boldsymbol{\mu})^{\top} dx=\boldsymbol{\Sigma}\\ \end{aligned}\end{equation}From the above, the result for the second-order moment can be derived:
\begin{equation} \mathbb{E}_{\boldsymbol{x}}\left[\boldsymbol{x}\boldsymbol{x}^{\top}\right]=\boldsymbol{\mu}\boldsymbol{\mu}^{\top} + \mathbb{E}_{\boldsymbol{x}}\left[(\boldsymbol{x}-\boldsymbol{\mu})(\boldsymbol{x}-\boldsymbol{\mu})^{\top}\right]=\boldsymbol{\mu}\boldsymbol{\mu}^{\top} + \boldsymbol{\Sigma}\end{equation}Another commonly used statistic is its entropy:
\begin{equation}\mathcal{H} = \mathbb{E}_{\boldsymbol{x}}\left[-\log p(\boldsymbol{x})\right]=\frac{n}{2}(1 + \log 2\pi) + \frac{1}{2}\log \det(\boldsymbol{\Sigma}) \end{equation}The calculation process for this can refer to the derivation for KL divergence later.
The probability density function implies that $\int p(\boldsymbol{x}) d\boldsymbol{x} = 1$, which leads to:
\begin{equation}\begin{aligned} \sqrt{(2\pi)^n \det(\boldsymbol{\Sigma})} =& \int\exp\left\{-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^{\top}\boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu})\right\}d\boldsymbol{x} \\ =& \int\exp\left\{-\frac{1}{2}\boldsymbol{x}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{x}+\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{x}-\frac{1}{2}\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}\right\}d\boldsymbol{x} \end{aligned}\end{equation}Let $\boldsymbol{\omega} = \boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}$, then we obtain the Gaussian integral:
\begin{equation} \int\exp\left\{-\frac{1}{2}\boldsymbol{x}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{x}+\boldsymbol{\omega}^{\top}\boldsymbol{x}\right\}d\boldsymbol{x} = \sqrt{(2\pi)^n \det(\boldsymbol{\Sigma})}\exp\left\{\frac{1}{2}\boldsymbol{\omega}^{\top}\boldsymbol{\Sigma}\boldsymbol{\omega}\right\}\label{eq:g-int} \end{equation}Using this, we can compute the characteristic function of the normal distribution:
\begin{equation}\mathbb{E}_{\boldsymbol{x}}\left[\exp\left(\boldsymbol{\omega}^{\top}\boldsymbol{x}\right)\right]=\exp\left(\boldsymbol{\omega}^{\top}\boldsymbol{\mu}+\frac{1}{2}\boldsymbol{\omega}^{\top}\boldsymbol{\Sigma}\boldsymbol{\omega}\right)\\ \end{equation}The characteristic function can be used to calculate all orders of moments of the normal distribution.
Here we supplement some basic linear algebra concepts that will be frequently used in the subsequent derivations. Similarly, this is only a "review."
First, we define the inner product and norm. For vectors $\boldsymbol{x}=(x_1,\dots,x_n)$ and $\boldsymbol{y}=(y_1,\dots,y_n)$, the inner product is defined as:
\begin{equation}\langle\boldsymbol{x},\boldsymbol{y}\rangle = \sum_{i=1}^n x_i y_i\end{equation}And the magnitude (length) is defined as $\|\boldsymbol{x}\| = \sqrt{\langle\boldsymbol{x},\boldsymbol{x}\rangle}$. For $m\times n$ matrices $\boldsymbol{A}=(a_{i,j})$ and $\boldsymbol{B}=(b_{i,j})$, we define similarly:
\begin{equation}\langle\boldsymbol{A},\boldsymbol{B}\rangle_F = \sum_{i=1}^m\sum_{j=1}^n a_{i,j} b_{i,j}\end{equation}This is called the Frobenius inner product, and the corresponding $\|\boldsymbol{A}\|_F = \sqrt{\langle\boldsymbol{A},\boldsymbol{A}\rangle_F}$ is called the Frobenius norm. It is easy to see that the Frobenius inner product and norm are essentially equivalent to flattening the matrix into a vector and performing standard vector operations.
Regarding the Frobenius inner product, one of the most critical properties is the identity:
\begin{equation}\langle\boldsymbol{A},\boldsymbol{B}\rangle_F = \text{Tr}\left(\boldsymbol{A}^{\top}\boldsymbol{B}\right) = \text{Tr}\left(\boldsymbol{B}\boldsymbol{A}^{\top}\right) = \text{Tr}\left(\boldsymbol{A}\boldsymbol{B}^{\top}\right) = \text{Tr}\left(\boldsymbol{B}^{\top}\boldsymbol{A}\right) \end{equation}That is, the Frobenius inner product of matrices can be converted into the trace of matrix multiplication, and swapping the order of multiplication does not change the result of the trace (while it does change the resulting matrix of the multiplication itself).
Next, let's look at the properties of symmetric positive definite matrices. If $\boldsymbol{\Sigma}$ is a symmetric positive definite matrix, "symmetric" means $\boldsymbol{\Sigma}^{\top}=\boldsymbol{\Sigma}$, and "positive definite" means that for any non-zero vector $\boldsymbol{\xi}\in\mathbb{R}^n$, $\boldsymbol{\xi}^{\top}\boldsymbol{\Sigma}\boldsymbol{\xi} > 0$. It can be proven that if $\boldsymbol{\Sigma}_1,\boldsymbol{\Sigma}_2$ are both symmetric positive definite matrices, then $\boldsymbol{\Sigma}_1^{-1},\boldsymbol{\Sigma}_2^{-1},\boldsymbol{\Sigma}_1+\boldsymbol{\Sigma}_2$ are also symmetric positive definite. If $\boldsymbol{C} = \boldsymbol{B}^{\top}\boldsymbol{A}\boldsymbol{B}$ and $\boldsymbol{B}$ is invertible, then $\boldsymbol{C}$ is symmetric positive definite if and only if $\boldsymbol{A}$ is symmetric positive definite.
There is also the concept of positive semi-definiteness, meaning that for any non-zero vector $\boldsymbol{\xi}\in\mathbb{R}^n$, $\boldsymbol{\xi}^{\top}\boldsymbol{\Sigma}\boldsymbol{\xi} \geq 0$. However, considering that positive definite matrices are dense within positive semi-definite matrices, we do not strictly distinguish between them here and treat them uniformly as positive definite matrices.
Symmetric positive definite matrices have an important property: their SVD decomposition is identical to their eigenvalue decomposition, having the following form:
\begin{equation}\boldsymbol{\Sigma} = \boldsymbol{U}\boldsymbol{\Lambda}\boldsymbol{U}^{\top}\end{equation}where $\boldsymbol{U}$ is an orthogonal matrix and $\boldsymbol{\Lambda}$ is a diagonal matrix with positive elements on the diagonal. A direct corollary is that symmetric positive definite matrices can have a "square root," which is $\boldsymbol{\Sigma}^{1/2} = \boldsymbol{U}\boldsymbol{\Lambda}^{1/2}\boldsymbol{U}^{\top}$, where $\boldsymbol{\Lambda}^{1/2}$ represents taking the square root of the diagonal elements. It can be verified that the square root matrix is also symmetric positive definite. Conversely, a symmetric matrix that can have a square root must be symmetric positive definite.
Finally, in calculating the Wasserstein distance, some matrix calculus formulas are required. Readers who are not familiar can refer to Wikipedia's "Matrix Calculus". The main formula used is:
\begin{equation}\frac{\partial\,\text{Tr}\left(\boldsymbol{X}\boldsymbol{A}\right)}{\partial \boldsymbol{X}} = \boldsymbol{A}\end{equation}Other formulas can be derived using trace identities, such as:
\begin{equation}\frac{\partial\,\text{Tr}\left(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\right)}{\partial \boldsymbol{X}} = \frac{\partial\,\text{Tr}\left(\boldsymbol{X}\boldsymbol{B}\boldsymbol{A}\right)}{\partial \boldsymbol{X}} = \boldsymbol{B}\boldsymbol{A}\end{equation}As our first objective, we calculate the KL divergence (Kullback-Leibler divergence) between two Gaussian distributions. KL divergence is one of the most commonly used distribution metrics because it involves a logarithm before integration, which usually yields relatively simple results for exponential family distributions. Additionally, it is closely related to "entropy."
The KL divergence between two probability distributions is defined as:
\begin{equation}KL(p(\boldsymbol{x})\Vert q(\boldsymbol{x}))=\mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\left[\log \frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}\right]=\mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\left[\log p(\boldsymbol{x})\right]+\mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\left[-\log q(\boldsymbol{x})\right]\end{equation}For two normal distributions, the calculated result is:
\begin{equation} KL(p(\boldsymbol{x})\Vert q(\boldsymbol{x}))=\frac{1}{2}\left[(\boldsymbol{\mu}_p-\boldsymbol{\mu}_q)^{\top}\boldsymbol{\Sigma}_q^{-1}(\boldsymbol{\mu}_p-\boldsymbol{\mu}_q)-\log \det(\boldsymbol{\Sigma}_q^{-1}\boldsymbol{\Sigma}_p) + \text{Tr}\left(\boldsymbol{\Sigma}_q^{-1}\boldsymbol{\Sigma}_p\right) - n\right] \end{equation}In particular, when $q$ is the standard normal distribution, the result simplifies to:
\begin{equation} KL(p(\boldsymbol{x})\Vert q(\boldsymbol{x}))=\frac{1}{2}\left[\|\boldsymbol{\mu}_p\|^2-\log \det(\boldsymbol{\Sigma}_p) + \text{Tr}(\boldsymbol{\Sigma}_p) - n\right]\end{equation}From the definition of KL divergence, we mainly need to compute $\mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\left[-\log q(\boldsymbol{x})\right]$:
\begin{equation}\begin{aligned} \mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\left[-\log q(\boldsymbol{x})\right] =&\, \mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\left[\frac{n}{2}\log 2\pi + \frac{1}{2}\log \det(\Sigma_q) + \frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu}_q)^{\top}\boldsymbol{\Sigma}_q^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_q)\right]\\ =&\,\frac{n}{2}\log 2\pi + \frac{1}{2}\log \det(\boldsymbol{\Sigma}_q) + \frac{1}{2}\mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\left[(\boldsymbol{x}-\boldsymbol{\mu}_q)^{\top}\boldsymbol{\Sigma}_q^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_q)\right] \end{aligned}\end{equation}Now, the trace identities come into play:
\begin{equation}\begin{aligned} \mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\left[(\boldsymbol{x}-\boldsymbol{\mu}_q)^{\top}\boldsymbol{\Sigma}_q^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_q)\right]=&\,\mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\left[\text{Tr}\left((\boldsymbol{x}-\boldsymbol{\mu}_q)^{\top}\boldsymbol{\Sigma}_q^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_q)\right)\right]\\ =&\,\mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\left[\text{Tr}\left(\boldsymbol{\Sigma}_q^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_q)(\boldsymbol{x}-\boldsymbol{\mu}_q)^{\top}\right)\right]\\ =&\,\text{Tr}\left(\boldsymbol{\Sigma}_q^{-1}\mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\left[(\boldsymbol{x}-\boldsymbol{\mu}_q)(\boldsymbol{x}-\boldsymbol{\mu}_q)^{\top}\right]\right)\\ =&\,\text{Tr}\left(\boldsymbol{\Sigma}_q^{-1}\mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\left[\boldsymbol{x}\boldsymbol{x}^{\top}-\boldsymbol{\mu}_q\boldsymbol{x}^{\top} - \boldsymbol{x}\boldsymbol{\mu}_q^{\top} + \boldsymbol{\mu}_q\boldsymbol{\mu}_q^{\top}\right]\right)\\ =&\,\text{Tr}\left(\boldsymbol{\Sigma}_q^{-1}\left(\boldsymbol{\Sigma}_p + \boldsymbol{\mu}_p\boldsymbol{\mu}_p^{\top}-\boldsymbol{\mu}_q\boldsymbol{\mu}_p^{\top} - \boldsymbol{\mu}_p\boldsymbol{\mu}_q^{\top} + \boldsymbol{\mu}_q\boldsymbol{\mu}_q^{\top}\right)\right)\\ =&\,\text{Tr}\left(\boldsymbol{\Sigma}_q^{-1}\boldsymbol{\Sigma}_p + \boldsymbol{\Sigma}_q^{-1}(\boldsymbol{\mu}_p-\boldsymbol{\mu}_q)(\boldsymbol{\mu}_p-\boldsymbol{\mu}_q)^{\top}\right)\\ =&\,\text{Tr}\left(\boldsymbol{\Sigma}_q^{-1}\boldsymbol{\Sigma}_p\right) + (\boldsymbol{\mu}_p-\boldsymbol{\mu}_q)^{\top}\boldsymbol{\Sigma}_q^{-1}(\boldsymbol{\mu}_p-\boldsymbol{\mu}_q)\\ \end{aligned}\end{equation}Note that when $\boldsymbol{\mu}_q=\boldsymbol{\mu}_p$ and $\boldsymbol{\Sigma}_q=\boldsymbol{\Sigma}_p$, the above equation equals $n$, which corresponds to the entropy of the normal distribution. Thus, the final result is:
\begin{equation}\begin{aligned} KL(p(\boldsymbol{x})\Vert q(\boldsymbol{x}))=&\,\frac{1}{2}\left[n\log 2\pi + \log \det(\boldsymbol{\Sigma}_q) + \text{Tr}\left(\boldsymbol{\Sigma}_q^{-1}\boldsymbol{\Sigma}_p\right) + (\boldsymbol{\mu}_p-\boldsymbol{\mu}_q)^{\top}\boldsymbol{\Sigma}_q^{-1}(\boldsymbol{\mu}_p-\boldsymbol{\mu}_q)\right] \\ &\,\qquad- \frac{1}{2}\left[n\log 2\pi + \log \det(\boldsymbol{\Sigma}_p) + n\right]\\ =&\,\frac{1}{2}\left[(\boldsymbol{\mu}_p-\boldsymbol{\mu}_q)^{\top}\boldsymbol{\Sigma}_q^{-1}(\boldsymbol{\mu}_p-\boldsymbol{\mu}_q)-\log \det(\boldsymbol{\Sigma}_q^{-1}\boldsymbol{\Sigma}_p) + \text{Tr}\left(\boldsymbol{\Sigma}_q^{-1}\boldsymbol{\Sigma}_p\right) - n\right] \end{aligned}\end{equation}Next, let's examine the Bhattacharyya distance (BD), defined as:
\begin{equation}BD(p(\boldsymbol{x}), q(\boldsymbol{x})) = -\log \int \sqrt{p(\boldsymbol{x}) q(\boldsymbol{x})} d\boldsymbol{x}\end{equation}A related concept is the "Hellinger distance", whose squared value is defined as $\frac{1}{2}\int \left(\sqrt{p(\boldsymbol{x})} - \sqrt{q(\boldsymbol{x})}\right)^2 d\boldsymbol{x}$. Expanding this reveals it is essentially equivalent to the Bhattacharyya distance.
For two normal distributions, their Bhattacharyya distance is:
\begin{equation} BD(p(\boldsymbol{x}), q(\boldsymbol{x})) = \frac{1}{2}\log \frac{\det(\boldsymbol{\Sigma})}{\sqrt{\det(\boldsymbol{\Sigma}_p\boldsymbol{\Sigma}_q)}} + \frac{1}{8}(\boldsymbol{\mu}_p - \boldsymbol{\mu}_q)^{\top}\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_p - \boldsymbol{\mu}_q) \end{equation}Here $\boldsymbol{\Sigma}=\frac{1}{2}(\boldsymbol{\Sigma}_p + \boldsymbol{\Sigma}_q)$. Note that the result is symmetric, as the definition of the Bhattacharyya distance itself is symmetric.
When one of the distributions is the standard normal distribution, the result does not simplify significantly, so it is not listed separately here.
By definition, the Bhattacharyya distance between two normal distributions is the negative logarithm of the following integral:
\begin{equation}\begin{aligned} &\qquad\int \sqrt{p(\boldsymbol{x}) q(\boldsymbol{x})} d\boldsymbol{x}=\frac{1}{\sqrt[4]{(2\pi)^{2n}\det(\boldsymbol{\Sigma}_p\boldsymbol{\Sigma}_q)}}\times \\ &\int \exp\left\{-\frac{1}{4}(\boldsymbol{x}-\boldsymbol{\mu}_p)^{\top}\boldsymbol{\Sigma}_p^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_p)-\frac{1}{4}(\boldsymbol{x}-\boldsymbol{\mu}_q)^{\top}\boldsymbol{\Sigma}_q^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_q)\right\}d\boldsymbol{x} \end{aligned}\end{equation}Let $\boldsymbol{y}=\boldsymbol{x}-\boldsymbol{\mu}_p$ and $\boldsymbol{\Delta}=\boldsymbol{\mu}_p - \boldsymbol{\mu}_q$. The integral part can be changed to variables:
\begin{equation}\begin{aligned} &\int \exp\left\{-\frac{1}{4}\boldsymbol{y}^{\top}\boldsymbol{\Sigma}_p^{-1}\boldsymbol{y}-\frac{1}{4}(\boldsymbol{y}+\boldsymbol{\Delta})^{\top}\boldsymbol{\Sigma}_q^{-1}(\boldsymbol{y}+\boldsymbol{\Delta})\right\}d\boldsymbol{y}\\ =&\int \exp\left\{-\frac{1}{4}\boldsymbol{y}^{\top}\left(\boldsymbol{\Sigma}_p^{-1}+\boldsymbol{\Sigma}_q^{-1}\right)\boldsymbol{y}-\frac{1}{2}\boldsymbol{\Delta}^{\top}\boldsymbol{\Sigma}_q^{-1}\boldsymbol{y} - \frac{1}{4}\boldsymbol{\Delta}^{\top}\boldsymbol{\Sigma}_q^{-1}\boldsymbol{\Delta}\right\}d\boldsymbol{y}\\ =&\int \exp\left\{-\frac{1}{2}\boldsymbol{y}^{\top}\left(\boldsymbol{\Sigma}_p^{-1}\boldsymbol{\Sigma}\boldsymbol{\Sigma}_q^{-1}\right)\boldsymbol{y}-\frac{1}{2}\boldsymbol{\Delta}^{\top}\boldsymbol{\Sigma}_q^{-1}\boldsymbol{y} - \frac{1}{4}\boldsymbol{\Delta}^{\top}\boldsymbol{\Sigma}_q^{-1}\boldsymbol{\Delta}\right\}d\boldsymbol{y}\end{aligned}\end{equation}where $\boldsymbol{\Sigma}=\frac{1}{2}(\boldsymbol{\Sigma}_p + \boldsymbol{\Sigma}_q)$. According to the Gaussian integral formula $\eqref{eq:g-int}$, the integral result is:
\begin{equation}\begin{aligned} &\,\sqrt{(2\pi)^n \det(\boldsymbol{\Sigma}_p^{-1}\boldsymbol{\Sigma}\boldsymbol{\Sigma}_q^{-1})^{-1}}\exp\left\{\frac{1}{8}\left(\boldsymbol{\Sigma}_q^{-1}\boldsymbol{\Delta}\right)^{\top}\left(\boldsymbol{\Sigma}_p^{-1}\boldsymbol{\Sigma}\boldsymbol{\Sigma}_q^{-1}\right)^{-1}\left(\boldsymbol{\Sigma}_q^{-1}\boldsymbol{\Delta}\right)-\frac{1}{4}\boldsymbol{\Delta}^{\top}\boldsymbol{\Sigma}_q^{-1}\boldsymbol{\Delta}\right\}\\ =&\,\sqrt{(2\pi)^n \det(\boldsymbol{\Sigma}_q\boldsymbol{\Sigma}^{-1}\boldsymbol{\Sigma}_p)}\exp\left\{\frac{1}{8}\boldsymbol{\Delta}^{\top}\left(\boldsymbol{\Sigma}^{-1}\boldsymbol{\Sigma}_p\boldsymbol{\Sigma}_q^{-1} - 2\boldsymbol{\Sigma}_q^{-1}\right)\boldsymbol{\Delta}\right\}\\ =&\,\sqrt{(2\pi)^n \det(\boldsymbol{\Sigma}_q\boldsymbol{\Sigma}^{-1}\boldsymbol{\Sigma}_p)}\exp\left\{\frac{1}{8}\boldsymbol{\Delta}^{\top}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{\Sigma}_p\boldsymbol{\Sigma}_q^{-1} - 2\boldsymbol{\Sigma}\boldsymbol{\Sigma}_q^{-1}\right)\boldsymbol{\Delta}\right\}\\ =&\,\sqrt{(2\pi)^n \det(\boldsymbol{\Sigma}_q\boldsymbol{\Sigma}^{-1}\boldsymbol{\Sigma}_p)}\exp\left\{-\frac{1}{8}\boldsymbol{\Delta}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\Delta}\right\} \end{aligned}\end{equation}Thus, finally:
\begin{equation}\begin{aligned} BD(p(\boldsymbol{x}), q(\boldsymbol{x})) =&\, -\log \left[\frac{\sqrt{(2\pi)^n \det(\boldsymbol{\Sigma}_q\boldsymbol{\Sigma}^{-1}\boldsymbol{\Sigma}_p)}}{\sqrt[4]{(2\pi)^{2n}\det(\boldsymbol{\Sigma}_p\boldsymbol{\Sigma}_q)}}\exp\left\{-\frac{1}{8}\boldsymbol{\Delta}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\Delta}\right\}\right] \\ =&\, -\log \left[\frac{\sqrt[4]{\det(\boldsymbol{\Sigma}_p\boldsymbol{\Sigma}_q)}}{\sqrt{\det\left(\boldsymbol{\Sigma}\right)}}\exp\left\{-\frac{1}{8}\boldsymbol{\Delta}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\Delta}\right\}\right]\\ =&\,\frac{1}{2}\log \frac{\det(\boldsymbol{\Sigma})}{\sqrt{\det(\boldsymbol{\Sigma}_p\boldsymbol{\Sigma}_q)}} + \frac{1}{8}(\boldsymbol{\mu}_p - \boldsymbol{\mu}_q)^{\top}\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_p - \boldsymbol{\mu}_q) \end{aligned} \end{equation}If readers wish to explore more about probability divergence, they can refer to the book "Statistical Inference Based on Divergence Measures". Now we turn to another class of probability metrics—the Wasserstein distance based on optimal transport.
Following the notation in "From Wasserstein Distance, Duality Theory to WGAN", the Wasserstein distance is defined as:
\begin{equation}\begin{aligned} \mathcal{W}_{\rho}[p,q]=&\,\left(\inf_{\gamma\in \Pi[p,q]} \iint \gamma(\boldsymbol{x},\boldsymbol{y}) \|\boldsymbol{x} - \boldsymbol{y}\|^{\rho} d\boldsymbol{x}d\boldsymbol{y}\right)^{1/\rho}\\ =&\,\left(\inf_{\gamma\in \Pi[p,q]} \mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\gamma(\boldsymbol{x},\boldsymbol{y})} \left[\|\boldsymbol{x} - \boldsymbol{y}\|^{\rho}\right]\right)^{1/\rho} \end{aligned}\end{equation}According to the references found by the author, an analytical solution for the Wasserstein distance between two normal distributions is currently known only for $\rho=2$. Below we solve for $\mathcal{W}_2[p,q]$, and for simplicity, let:
\begin{equation}\mathcal{W}_2^2[p,q] = \left(\mathcal{W}_2[p,q]\right)^2\end{equation}Interestingly, regarding the result of the Wasserstein distance between two normal distributions, there are two different versions in circulation. Both versions have a certain level of recognition, but no documentation explicitly states their equivalence. These two versions come from different papers and have been given different names.
The relatively more widely circulated version, often found when searching for "Wasserstein distance of normal distributions," provides this result:
\begin{equation}\mathcal{W}_2^2[p,q]=\|\boldsymbol{\mu}_p - \boldsymbol{\mu}_q\|^2 + \text{Tr}(\boldsymbol{\Sigma}_p) + \text{Tr}(\boldsymbol{\Sigma}_q) - 2\text{Tr}((\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{\Sigma}_q\boldsymbol{\Sigma}_p^{1/2})^{1/2})\label{eq:w-v1}\end{equation}Some readers might be confused as to why this doesn't look symmetric with respect to $p$ and $q$. In fact, it is symmetric because:
\begin{equation}\begin{aligned}\text{Tr}((\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{\Sigma}_q\boldsymbol{\Sigma}_p^{1/2})^{1/2})=&\,\text{Tr}((\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{\Sigma}_q\boldsymbol{\Sigma}_p^{1/2})^{1/2}\boldsymbol{\Sigma}_p^{-1/2}\boldsymbol{\Sigma}_q^{-1/2}\boldsymbol{\Sigma}_q^{1/2}\boldsymbol{\Sigma}_p^{1/2})\\ =&\,\text{Tr}(\boldsymbol{\Sigma}_q^{1/2}\boldsymbol{\Sigma}_p^{1/2}(\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{\Sigma}_q\boldsymbol{\Sigma}_p^{1/2})^{1/2}\boldsymbol{\Sigma}_p^{-1/2}\boldsymbol{\Sigma}_q^{-1/2}) \end{aligned}\end{equation}Then we can directly verify that $(\boldsymbol{\Sigma}_q^{1/2}\boldsymbol{\Sigma}_p^{1/2}(\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{\Sigma}_q\boldsymbol{\Sigma}_p^{1/2})^{1/2}\boldsymbol{\Sigma}_p^{-1/2}\boldsymbol{\Sigma}_q^{-1/2})^2=\boldsymbol{\Sigma}_q^{1/2}\boldsymbol{\Sigma}_p\boldsymbol{\Sigma}_q^{1/2}$, so $\text{Tr}((\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{\Sigma}_q\boldsymbol{\Sigma}_p^{1/2})^{1/2})=\text{Tr}((\boldsymbol{\Sigma}_q^{1/2}\boldsymbol{\Sigma}_p\boldsymbol{\Sigma}_q^{1/2})^{1/2})$.
The second version looks slightly simpler:
\begin{equation}\mathcal{W}_2^2[p,q]=\|\boldsymbol{\mu}_p - \boldsymbol{\mu}_q\|^2 + \text{Tr}(\boldsymbol{\Sigma}_p) + \text{Tr}(\boldsymbol{\Sigma}_q) - 2\text{Tr}((\boldsymbol{\Sigma}_p\boldsymbol{\Sigma}_q)^{1/2})\label{eq:w-v2}\end{equation}This version is commonly called the "Fréchet distance." Usually, one can only find this result by using the keyword "Fréchet distance of normal distributions." The evaluation metric FID (Frechet Inception Distance) frequently used in GANs is calculated based on this formula. Its symmetry can be proven similarly to the previous proof, or derived directly from the equivalence discussion below.
Logically, Version 2 is more concise and should be the standard. Why both versions still circulate simultaneously is quite baffling. Theoretically, proving the equivalence is not difficult. Based on trace identities, we have:
\begin{equation}\begin{aligned}\text{Tr}((\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{\Sigma}_q\boldsymbol{\Sigma}_p^{1/2})^{1/2})=&\,\text{Tr}((\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{\Sigma}_q\boldsymbol{\Sigma}_p^{1/2})^{1/2}\boldsymbol{\Sigma}_p^{-1/2}\boldsymbol{\Sigma}_p^{1/2})\\ =&\,\text{Tr}(\boldsymbol{\Sigma}_p^{1/2}(\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{\Sigma}_q\boldsymbol{\Sigma}_p^{1/2})^{1/2}\boldsymbol{\Sigma}_p^{-1/2}) \end{aligned}\end{equation}Then just verify directly that $(\boldsymbol{\Sigma}_p^{1/2}(\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{\Sigma}_q\boldsymbol{\Sigma}_p^{1/2})^{1/2}\boldsymbol{\Sigma}_p^{-1/2})^2=\boldsymbol{\Sigma}_p \boldsymbol{\Sigma}_q$.
In particular, if the multiplication of $\boldsymbol{\Sigma}_p$ and $\boldsymbol{\Sigma}_q$ is commutative, the formula simplifies into a very intuitive form:
\begin{equation}\mathcal{W}_2^2[p,q]=\|\boldsymbol{\mu}_p - \boldsymbol{\mu}_q\|^2 + \|\boldsymbol{\Sigma}_p^{1/2} - \boldsymbol{\Sigma}_q^{1/2}\|_F^2\label{eq:w-jiaohuan}\end{equation}Why is this intuitive? Since the parameters of the normal distributions are $\boldsymbol{\mu},\boldsymbol{\Sigma}$, comparing their difference is essentially comparing the differences in $\boldsymbol{\mu}$ and $\boldsymbol{\Sigma}$. According to machine learning conventions, a natural metric would be the squared error:
\begin{equation}\mathcal{W}_2^2[p,q]=\|\boldsymbol{\mu}_p - \boldsymbol{\mu}_q\|^2 + \|\boldsymbol{\Sigma}_p - \boldsymbol{\Sigma}_q\|_F^2\end{equation}However, from a physical perspective, this metric is inappropriate. If $\boldsymbol{\mu}$ is considered to have a dimension of length, then $\boldsymbol{\Sigma}$ has a dimension of length squared. Therefore, $\|\boldsymbol{\mu}_p - \boldsymbol{\mu}_q\|^2$ and $\|\boldsymbol{\Sigma}_p - \boldsymbol{\Sigma}_q\|_F^2$ have different dimensions and cannot be added. To maintain dimensional consistency, the intuitive approach is to "take the square root" of $\boldsymbol{\Sigma}$ before calculating the squared error, leading to equation $\eqref{eq:w-jiaohuan}$.
Specifically, when $q$ is the standard normal distribution, the result simplifies to:
\begin{equation}\mathcal{W}_2^2[p,q]=\|\boldsymbol{\mu}_p\|^2 + \|\boldsymbol{\Sigma}_p^{1/2} - \boldsymbol{I}\|_F^2\end{equation}We now introduce the first proof, which mainly refers to the paper "A class of Wasserstein metrics for probability distributions". Additionally, "The distance between two random vectors with given dispersion matrices" provides a similar proof.
The following derivation has been simplified by the author. It is easier than the original paper's proof but still involves significant linear algebra. We will introduce it in several parts.
Without loss of generality, we can consider distributions $p,q$ with zero mean. If the means of $p,q$ are non-zero, let the corresponding zero-mean distributions be $\tilde{p},\tilde{q}$. We have:
\begin{equation}\begin{aligned} &\,\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\gamma(\boldsymbol{x},\boldsymbol{y})}\left[\| \boldsymbol{x} - \boldsymbol{y}\|^2\right] \\ =&\, \mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\tilde{\gamma}(\boldsymbol{x},\boldsymbol{y})}\left[\| (\boldsymbol{x} + \boldsymbol{\mu}_p) - (\boldsymbol{y} + \boldsymbol{\mu}_q)\|^2 \right]\\ =&\,\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\tilde{\gamma}(\boldsymbol{x},\boldsymbol{y})}\left[\| \boldsymbol{x} - \boldsymbol{y}\|^2 + \| \boldsymbol{\mu}_p - \boldsymbol{\mu}_q\|^2 + 2\langle\boldsymbol{x} - \boldsymbol{y}, \boldsymbol{\mu}_p - \boldsymbol{\mu}_q\rangle\right]\\ =&\,\,\| \boldsymbol{\mu}_p - \boldsymbol{\mu}_q\|^2 + \mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\tilde{\gamma}(\boldsymbol{x},\boldsymbol{y})}\left[\| \boldsymbol{x} - \boldsymbol{y}\|^2 \right] \end{aligned}\end{equation}This implies:
\begin{equation}\mathcal{W}_2^2[p,q]=\|\boldsymbol{\mu}_p - \boldsymbol{\mu}_q\|^2 + \mathcal{W}_2^2[\tilde{p},\tilde{q}]\end{equation}Therefore, we only need to calculate the Wasserstein distance for the zero-mean case and then add $\|\boldsymbol{\mu}_p - \boldsymbol{\mu}_q\|^2$.
Assuming the means of $p,q$ are 0, we calculate:
\begin{equation}\begin{aligned} \mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\gamma(\boldsymbol{x},\boldsymbol{y})}\left[\|\boldsymbol{x} - \boldsymbol{y}\|^2\right] =&\, \mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\gamma(\boldsymbol{x},\boldsymbol{y})}\left[\boldsymbol{x}^{\top} \boldsymbol{x} + \boldsymbol{y}^{\top} \boldsymbol{y} - 2\boldsymbol{y}^{\top} \boldsymbol{x}\right]\\ =&\, \mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\gamma(\boldsymbol{x},\boldsymbol{y})}\left[\text{Tr}\left(\boldsymbol{x} \boldsymbol{x}^{\top} + \boldsymbol{y} \boldsymbol{y}^{\top} - 2\boldsymbol{x}\boldsymbol{y}^{\top} \right)\right]\\ =&\, \text{Tr}\left(\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\gamma(\boldsymbol{x},\boldsymbol{y})}\left[\boldsymbol{x} \boldsymbol{x}^{\top} + \boldsymbol{y} \boldsymbol{y}^{\top} - 2\boldsymbol{x}\boldsymbol{y}^{\top}\right]\right)\\ =&\, \text{Tr}(\boldsymbol{\Sigma}_p) + \text{Tr}(\boldsymbol{\Sigma}_q) - 2\text{Tr}(\boldsymbol{C}) \end{aligned}\end{equation}where
\begin{equation}\boldsymbol{\Sigma}_{\gamma}= \begin{pmatrix} \boldsymbol{\Sigma}_p & \boldsymbol{C}\\ \boldsymbol{C}^{\top} & \boldsymbol{\Sigma}_q\end{pmatrix}=\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\gamma(\boldsymbol{x},\boldsymbol{y})}\left[\begin{pmatrix}\boldsymbol{x} \\ \boldsymbol{y}\end{pmatrix}\begin{pmatrix}\boldsymbol{x}^{\top} & \boldsymbol{y}^{\top}\end{pmatrix}\right]\end{equation}is the joint covariance matrix of $\gamma$. Since any covariance matrix is symmetric positive semi-definite, the problem becomes:
Given that $\boldsymbol{\Sigma}_{\gamma}= \begin{pmatrix} \boldsymbol{\Sigma}_p & \boldsymbol{C}\\ \boldsymbol{C}^{\top} & \boldsymbol{\Sigma}_q\end{pmatrix}$ is symmetric positive definite, find the maximum value of $\text{Tr}(\boldsymbol{C})$.
For this, we use the following identity regarding the Schur complement:
\begin{equation} \begin{pmatrix} \boldsymbol{\Sigma}_p & \boldsymbol{C}\\ \boldsymbol{C}^{\top} & \boldsymbol{\Sigma}_q\end{pmatrix} = \begin{pmatrix} \boldsymbol{I} & \boldsymbol{0}\\ \boldsymbol{C}^{\top}\boldsymbol{\Sigma}_p^{-1} & \boldsymbol{I}\end{pmatrix} \begin{pmatrix} \boldsymbol{\Sigma}_p & \boldsymbol{0}\\ \boldsymbol{0} & \boldsymbol{\Sigma}_q - \boldsymbol{C}^{\top}\boldsymbol{\Sigma}_p^{-1}\boldsymbol{C}\end{pmatrix} \begin{pmatrix} \boldsymbol{I} & \boldsymbol{\Sigma}_p^{-1}\boldsymbol{C} \\ \boldsymbol{0} & \boldsymbol{I}\end{pmatrix} \end{equation}where the symmetric matrix $\boldsymbol{S} = \boldsymbol{\Sigma}_q - \boldsymbol{C}^{\top}\boldsymbol{\Sigma}_p^{-1}\boldsymbol{C}$ is called the "Schur Complement." This decomposition has the form $\boldsymbol{B}^{\top}\boldsymbol{A}\boldsymbol{B}$. For it to be positive definite, $\boldsymbol{A}$ must be positive definite. Since $\boldsymbol{\Sigma}_p$ is already positive definite, $\boldsymbol{S}$ must also be positive definite.
We try to isolate the parameters, specifically $\boldsymbol{C}$ from $\boldsymbol{S} = \boldsymbol{\Sigma}_q - \boldsymbol{C}^{\top}\boldsymbol{\Sigma}_p^{-1}\boldsymbol{C}$. Rearranging gives $\boldsymbol{\Sigma}_q - \boldsymbol{S} = \boldsymbol{C}^{\top}\boldsymbol{\Sigma}_p^{-1}\boldsymbol{C}$. Since $\boldsymbol{\Sigma}_p$ is symmetric positive definite, $\boldsymbol{\Sigma}_p^{-1}$ is as well, making $\boldsymbol{C}^{\top}\boldsymbol{\Sigma}_p^{-1}\boldsymbol{C}$ symmetric positive definite. It therefore has a symmetric positive definite square root, say $\boldsymbol{R}$, such that:
\begin{equation}\boldsymbol{C}^{\top}\boldsymbol{\Sigma}_p^{-1}\boldsymbol{C} = \boldsymbol{R}^2\quad\Leftrightarrow\quad \left(\boldsymbol{\Sigma}_p^{-1/2}\boldsymbol{C}\boldsymbol{R}^{-1}\right)^{\top}\left(\boldsymbol{\Sigma}_p^{-1/2}\boldsymbol{C}\boldsymbol{R}^{-1}\right)=\boldsymbol{I}\end{equation}This implies that $\boldsymbol{\Sigma}_p^{-1/2}\boldsymbol{C}\boldsymbol{R}^{-1}$ is an orthogonal matrix, denoted as $\boldsymbol{O}$, so $\boldsymbol{C} = \boldsymbol{\Sigma}_p^{1/2}\boldsymbol{O}\boldsymbol{R}$.
The variables now are $\boldsymbol{O}$ and $\boldsymbol{R}$. We want to maximize $\text{Tr}(\boldsymbol{C})=\text{Tr}(\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{O}\boldsymbol{R})$. Fixing $\boldsymbol{R}$ first, we find the $\boldsymbol{O}$ that maximizes this under the constraint $\boldsymbol{O}^{\top}\boldsymbol{O}=\boldsymbol{I}$. We use Lagrange Multipliers: introduce a symmetric parameter matrix $\boldsymbol{W}$ and transform this into an unconstrained problem:
\begin{equation}F = \text{Tr}(\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{O}\boldsymbol{R}) - \frac{1}{2}\text{Tr}(\boldsymbol{W}(\boldsymbol{O}^{\top}\boldsymbol{O} - \boldsymbol{I}))\end{equation}Taking derivatives:
\begin{equation}\begin{aligned} &\frac{\partial F}{\partial \boldsymbol{O}} = \boldsymbol{0} \quad \Rightarrow\quad \boldsymbol{R}\boldsymbol{\Sigma}_p^{1/2} = \boldsymbol{W}\boldsymbol{O}^{\top}\\ &\frac{\partial F}{\partial \boldsymbol{W}} = \boldsymbol{0} \quad \Rightarrow\quad \boldsymbol{O}^{\top}\boldsymbol{O} = \boldsymbol{I}\\ \end{aligned}\end{equation}Since $\boldsymbol{O}^{\top}\boldsymbol{O} - \boldsymbol{I}$ is symmetric, $\boldsymbol{W}$ is also symmetric. We have:
\begin{equation}\left(\boldsymbol{O}\boldsymbol{W}\boldsymbol{O}^{\top}\right)^2=\left(\boldsymbol{W}\boldsymbol{O}^{\top}\right)^{\top}\left(\boldsymbol{W}\boldsymbol{O}^{\top}\right)=\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{R}^2\boldsymbol{\Sigma}_p^{1/2}\end{equation}That is, $\boldsymbol{O}\boldsymbol{W}\boldsymbol{O}^{\top}=(\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{R}^2\boldsymbol{\Sigma}_p^{1/2})^{1/2}$, so at this point:
\begin{equation}\text{Tr}(\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{O}\boldsymbol{R})=\text{Tr}(\boldsymbol{O}\boldsymbol{R}\boldsymbol{\Sigma}_p^{1/2})=\text{Tr}(\boldsymbol{O}\boldsymbol{W}\boldsymbol{O}^{\top})=\text{Tr}((\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{R}^2\boldsymbol{\Sigma}_p^{1/2})^{1/2})\end{equation}Finally, we need to determine $\boldsymbol{R}$. Recalling the definition $\boldsymbol{R}^2=\boldsymbol{\Sigma}_q - \boldsymbol{S}$ where $\boldsymbol{S}$ is positive definite. Intuitively, the maximum is achieved when $\boldsymbol{S}=\boldsymbol{0}$. This is indeed a direct corollary of Weyl's Inequality.
According to Weyl's inequality, if matrices $\boldsymbol{A}$ and $\boldsymbol{B}$ are symmetric positive definite, let their eigenvalues in non-decreasing order be $0\leq\lambda_1^{(A)} \leq \dots \leq \lambda_n^{(A)}$, $\lambda_1^{(B)} \leq \dots \leq \lambda_n^{(B)}$, and $0\leq\lambda_1^{(A+B)} \leq \dots \leq \lambda_n^{(A+B)}$. Then for any $1\leq i \leq n$, $\lambda_i^{(A)}\leq \lambda_i^{(A+B)}$ and $\lambda_i^{(B)}\leq \lambda_i^{(A+B)}$. That is:
The eigenvalues of the sum of symmetric positive definite matrices are coordinate-wise greater than the eigenvalues of the individual matrices.
With this conclusion, let the eigenvalues of $(\boldsymbol{\Sigma}_p^{1/2}(\boldsymbol{\Sigma}_q - \boldsymbol{S})\boldsymbol{\Sigma}_p^{1/2})^{1/2}$ be $0 \leq \lambda_1 \leq \dots \leq \lambda_n$. Its trace is $\lambda_1 + \dots + \lambda_n$, and the eigenvalues of $\boldsymbol{\Sigma}_p^{1/2}(\boldsymbol{\Sigma}_q - \boldsymbol{S})\boldsymbol{\Sigma}_p^{1/2}$ are $0 \leq \lambda_1^2 \leq \dots \leq \lambda_n^2$. Since $\boldsymbol{\Sigma}_p^{1/2}(\boldsymbol{\Sigma}_q - \boldsymbol{S})\boldsymbol{\Sigma}_p^{1/2}$ and $\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{S}\boldsymbol{\Sigma}_p^{1/2}$ are both symmetric positive definite, their eigenvalues do not exceed those of their sum, $\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{\Sigma}_q\boldsymbol{\Sigma}_p^{1/2}$. Thus, the maximum of each eigenvalue (and the trace) of $(\boldsymbol{\Sigma}_p^{1/2}(\boldsymbol{\Sigma}_q - \boldsymbol{S})\boldsymbol{\Sigma}_p^{1/2})^{1/2}$ is achieved at $\boldsymbol{S}=\boldsymbol{0}$.
As for the proof of Weyl's inequality, it mainly utilizes the Rayleigh quotient and the Courant–Fischer Theorem. Interested readers can search for these two parts. Familiarity with them makes the Weyl inequality essentially self-evident.
We provide a simpler proof found in "The Fréchet distance between multivariate normal distributions". Compared to the first proof, this one is more direct and requires less pure linear algebra theory. The following process has been further simplified by the author.
Like Derivation 1, the "Demeaning" and "Algebraic Formulation" steps are the same. The problem is reduced to:
Given that $\boldsymbol{\Sigma}_{\gamma}= \begin{pmatrix} \boldsymbol{\Sigma}_p & \boldsymbol{C}\\ \boldsymbol{C}^{\top} & \boldsymbol{\Sigma}_q\end{pmatrix}$ is symmetric positive definite, find the maximum value of $\text{Tr}(\boldsymbol{C})$.
Since $\boldsymbol{\Sigma}_{\gamma}$ is symmetric positive definite, it must be expressible in the form $\boldsymbol{D}\boldsymbol{D}^{\top}$. We express $\boldsymbol{D}$ as a block matrix $\begin{pmatrix}\boldsymbol{A} \\ \boldsymbol{B}\end{pmatrix}$ where $\boldsymbol{A},\boldsymbol{B}\in \mathbb{R}^{n\times 2n}$. Then:
\begin{equation}\begin{pmatrix} \boldsymbol{\Sigma}_p & \boldsymbol{C}\\ \boldsymbol{C}^{\top} & \boldsymbol{\Sigma}_q\end{pmatrix} = \begin{pmatrix}\boldsymbol{A} \\ \boldsymbol{B}\end{pmatrix} \begin{pmatrix}\boldsymbol{A}^{\top} & \boldsymbol{B}^{\top}\end{pmatrix} = \begin{pmatrix} \boldsymbol{A}\boldsymbol{A}^{\top} & \boldsymbol{A}\boldsymbol{B}^{\top}\\ \boldsymbol{B}\boldsymbol{A}^{\top} & \boldsymbol{B}\boldsymbol{B}^{\top}\end{pmatrix}\end{equation}Correspondingly, $\boldsymbol{\Sigma}_p=\boldsymbol{A}\boldsymbol{A}^{\top}, \boldsymbol{\Sigma}_q=\boldsymbol{B}\boldsymbol{B}^{\top},\boldsymbol{C}=\boldsymbol{A}\boldsymbol{B}^{\top}$.
Under this parameterization, the problem becomes:
Given $\boldsymbol{A}\boldsymbol{A}^{\top}=\boldsymbol{\Sigma}_p$ and $\boldsymbol{B}\boldsymbol{B}^{\top}=\boldsymbol{\Sigma}_q$, find the maximum value of $\text{Tr}(\boldsymbol{A}\boldsymbol{B}^{\top})$.
Using Lagrange Multipliers, introduce symmetric matrices $\boldsymbol{W}_p, \boldsymbol{W}_q$ and define:
\begin{equation}F = \text{Tr}(\boldsymbol{A}\boldsymbol{B}^{\top}) - \frac{1}{2}\text{Tr}(\boldsymbol{W}_p(\boldsymbol{A}\boldsymbol{A}^{\top} - \boldsymbol{\Sigma}_p)) - \frac{1}{2}\text{Tr}(\boldsymbol{W}_q(\boldsymbol{B}\boldsymbol{B}^{\top} - \boldsymbol{\Sigma}_q))\end{equation}Differentiating:
\begin{equation}\begin{aligned} &\frac{\partial F}{\partial \boldsymbol{A}} = \boldsymbol{0} \quad \Rightarrow\quad \boldsymbol{B}^{\top} = \boldsymbol{A}^{\top}\boldsymbol{W}_p\\ &\frac{\partial F}{\partial \boldsymbol{B}} = \boldsymbol{0} \quad \Rightarrow\quad \boldsymbol{A}^{\top} = \boldsymbol{B}^{\top}\boldsymbol{W}_q\\ &\frac{\partial F}{\partial \boldsymbol{W}_p} = \boldsymbol{0} \quad \Rightarrow\quad \boldsymbol{A}\boldsymbol{A}^{\top} = \boldsymbol{\Sigma}_p\\ &\frac{\partial F}{\partial \boldsymbol{W}_q} = \boldsymbol{0} \quad \Rightarrow\quad \boldsymbol{B}\boldsymbol{B}^{\top} = \boldsymbol{\Sigma}_q\\ \end{aligned}\end{equation}Since $\boldsymbol{W}_p, \boldsymbol{W}_q$ are symmetric, we have:
\begin{equation}\boldsymbol{\Sigma}_q = \boldsymbol{B}\boldsymbol{B}^{\top} = \left(\boldsymbol{A}^{\top}\boldsymbol{W}_p\right)^{\top}\left(\boldsymbol{A}^{\top}\boldsymbol{W}_p\right)=\boldsymbol{W}_p\boldsymbol{A}\boldsymbol{A}^{\top}\boldsymbol{W}_p=\boldsymbol{W}_p\boldsymbol{\Sigma}_p\boldsymbol{W}_p\\ \end{equation}Let $\boldsymbol{W}_p=\boldsymbol{\Sigma}_p^{-1/2}\boldsymbol{R}\boldsymbol{\Sigma}_p^{-1/2}$, then $\boldsymbol{\Sigma}_q=\boldsymbol{\Sigma}_p^{-1/2}\boldsymbol{R}^2\boldsymbol{\Sigma}_p^{-1/2}$, so:
\begin{equation}\boldsymbol{R} = (\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{\Sigma}_q\boldsymbol{\Sigma}_p^{1/2})^{1/2} \end{equation}And finally:
\begin{equation}\begin{aligned} \text{Tr}(\boldsymbol{A}\boldsymbol{B}^{\top}) =&\, \text{Tr}(\boldsymbol{A}\boldsymbol{A}^{\top}\boldsymbol{W}_p)=\text{Tr}(\boldsymbol{\Sigma}_p\boldsymbol{W}_p)\\ =&\,\text{Tr}(\boldsymbol{\Sigma}_p^{1/2}\boldsymbol{R}\boldsymbol{\Sigma}_p^{-1/2}) = \text{Tr}(\boldsymbol{R}\boldsymbol{\Sigma}_p^{-1/2}\boldsymbol{\Sigma}_p^{1/2})\\ =&\,\text{Tr}(\boldsymbol{R}) \end{aligned}\end{equation}This article detailed the calculation of KL divergence, Bhattacharyya distance, and Wasserstein distance between two multivariate normal distributions, providing explicit analytical solutions. These results can serve as regularization terms for latent variables in certain scenarios to constrain their distribution. Furthermore, this article can be viewed as a set of challenging linear algebra exercises for readers to practice with.