From Maximum Likelihood to EM Algorithm: A Consistent Way of Understanding

By 苏剑林 | March 15, 2018

Recently, I have been thinking about content related to unsupervised learning in NLP and probabilistic graphical models, so I revisited some parameter estimation methods. In deep learning, parameter estimation is one of the most basic steps, which is what we call the model training process. To train a model, there must be a loss function. For readers who haven't systematically studied probability theory, the most natural loss function they might think of is Mean Squared Error (MSE), which corresponds to what we call Euclidean distance. Theoretically speaking, the best match for a probabilistic model should be the "Cross-Entropy" function, which originates from the Maximum Likelihood function in probability theory.

Maximum Likelihood

Rational Existence

What is Maximum Likelihood? There is a philosophical saying: "To exist is to be rational." Maximum Likelihood means "to exist is to be most rational." Specifically, if the probability distribution of event $X$ is $p(X)$, and if the specific values observed in a single observation are $X_1, X_2, \dots, X_n$, assuming they are independent, then

$$\mathcal{P} = \prod_{i=1}^n p(X_i)\tag{1}$$

is maximized. If $p(X)$ is a probability distribution $p_{\theta}(X)$ with a parameter $\theta$, then we should find a way to choose $\theta$ that maximizes $\mathcal{P}(\theta)$, i.e.,

$$\theta = \mathop{\text{argmax}}_{\theta} \mathcal{P}(\theta) = \mathop{\text{argmax}}_{\theta}\prod_{i=1}^n p_{\theta}(X_i)\tag{2}$$

Taking the logarithm of the probability yields the equivalent form:

$$\theta = \mathop{\text{argmax}}_{\theta}\sum_{i=1}^n \log p_{\theta}(X_i)\tag{3}$$

If we divide the right side by $n$, we get a more refined expression:

$$\theta = \mathop{\text{argmax}}_{\theta} \mathcal{L}(\theta) = \mathop{\text{argmax}}_{\theta} \mathbb{E}\big[\log p_{\theta}(X_i)\big]\tag{4}$$

where we refer to $-\mathcal{L}(\theta)$ as the cross-entropy.

Theoretical Form

Theoretically, based on existing data, we can obtain the statistical frequency $\tilde{p}(X)$ for each $X$. Then we can obtain the equivalent form of the above equation:

$$\theta = \mathop{\text{argmax}}_{\theta} \mathcal{L}(\theta) = \mathop{\text{argmax}}_{\theta}\sum_X \tilde{p}(X)\log p_{\theta}(X)\tag{5}$$

However, in practice, it is almost impossible for us to obtain $\tilde{p}(X)$ (especially for continuous distributions). What we can calculate directly is its mathematical expectation, which is equation $(4)$, because calculating the expectation only requires calculating the value for each sample, then summing them and dividing by $n$. Therefore, equation $(5)$ has only theoretical value; it facilitates the subsequent derivations.

It should be noted that the description above is very general. $X$ can be any object, and it might be a continuous real number, in which case the summation would be replaced by an integral and $p(X)$ would become a probability density function. Of course, this does not present any fundamental difficulty.

Broader KL Divergence

The form of Maximum Likelihood can also be derived from KL divergence. Suppose there are two distributions $\tilde{p}(X)$ and $p(X)$. We use KL divergence to measure the distance between them:

$$\begin{aligned}KL\Big(\tilde{p}(X)\Big\Vert p(X)\Big) =& \sum_X \tilde{p}(X) \ln \frac{\tilde{p}(X)}{p(X)}\\ =&\mathbb{E}\left[\ln \frac{\tilde{p}(X)}{p(X)}\right]\end{aligned}\tag{6}$$

When the two distributions are the same, the KL divergence is 0; when they are different, the KL divergence is greater than 0. It is assumed the reader is already familiar with these properties.

Suppose the samples of $X$ have been given, which means $\tilde{p}(X)$ can be regarded as known. In this case:

$$\begin{aligned}\theta =& \mathop{\text{argmin}}_{\theta} KL\Big(\tilde{p}(X)\Big\Vert p_{\theta}(X)\Big)\\ =& \mathop{\text{argmax}}_{\theta}\sum_X \tilde{p}(X)\log p_{\theta}(X)\\ =& \mathop{\text{argmax}}_{\theta}\mathbb{E}\big[\log p_{\theta}(X_i)\big]\end{aligned}\tag{7}$$

This re-derives $(4)$ and $(5)$. In fact, KL divergence is much richer in meaning than simple Maximum Likelihood because Maximum Likelihood is equivalent to assuming that $\tilde{p}(X)$ is known (given samples of $X$), which is not always achievable (such as in EM algorithm scenarios). Often, we only know partial information about $X$, at which point we must return to KL divergence.

Note: If readers cannot well understand sampling calculation, please read the section "Numerical Calculation vs. Sampling Calculation" in "Variational Autoencoder (2): From a Bayesian Perspective".

Supervised Models

Now let's observe how the above content is applied in supervised learning. Assume the input is $X$ and the label is $Y$. Then $(X, Y)$ constitutes an event, so according to $(4)$ we have:

$$\theta = \mathop{\text{argmax}}_{\theta} \mathbb{E}_{X,Y}\big[\log p_{\theta}(X,Y)\big]\tag{8}$$

Here it is noted that the mathematical expectation is taken over $X, Y$ as a whole, but this formula is not practical enough.

Classification Problems

Taking classification problems as an example, we usually model $p(Y|X)$ instead of $p(X, Y)$. That is, we determine the distribution of the output based on the input, rather than their joint distribution. So we still start from equation $(5)$, using $p(X,Y)=p(X) p(Y|X)$, to first get:

$$\theta = \mathop{\text{argmax}}_{\theta} \sum_{X,Y} \tilde{p}(X,Y)\log \big[p_{\theta}(X)p_{\theta}(Y|X)\big]\tag{9}$$

Since we only model $p(Y|X)$, we assume $p_{\theta}(X)$ is just $\tilde{p}(X)$. This is equivalent to adding a constant term to the optimization objective, so $(9)$ is equivalent to:

$$\theta = \mathop{\text{argmax}}_{\theta} \sum_{X,Y} \tilde{p}(X, Y)\log p_{\theta}(Y|X)\tag{10}$$

Then, we also have $\tilde{p}(X,Y)=\tilde{p}(X)\tilde{p}(Y|X)$, so equation $(8)$ can be further transformed into:

$$\theta = \mathop{\text{argmax}}_{\theta} \sum_X \tilde{p}(X) \sum_Y \tilde{p}(Y|X)\log p_{\theta}(Y|X)\tag{11}$$

Finally, let's not forget that we are dealing with classification problems in supervised learning. Generally speaking, in training data, there is only one category for a specific input $X$, so $\tilde{p}(Y_t|X)=1$ and the rest are 0, where $Y_t$ is the target label for $X$. Therefore:

$$\theta = \mathop{\text{argmax}}_{\theta} \sum_X \tilde{p}(X) \log p_{\theta}(Y_t|X)\tag{12}$$

This is the most common Maximum Likelihood function for classification problems:

$$\theta = \mathop{\text{argmax}}_{\theta} \mathbb{E}_{X}\big[\log p_{\theta}(Y_t|X)\big]\tag{13}$$

Transformations

In fact, the content above consists merely of identity transformations. It should be said that they have no particularly important value, and their result (the cross-entropy loss for classification problems) has long been used by us quite skillfully. Therefore, this section merely demonstrates how to start from the most primitive form of the Maximum Likelihood function and ultimately apply it to a specific problem, allowing readers to familiarize themselves with this progressive transformation process.

Latent Variables

Now is the time to show its value. We will **use it to provide a direct derivation of the EM algorithm** (this blog also provides another perspective for understanding, see "Gradient Descent and EM Algorithm: Same Origin, Same Lineage"). For the EM algorithm, it is generally divided into an M-step and an E-step. It should be said that the M-step is relatively easy to understand; the difficulty lies in why the $Q$ function in the E-step is constructed this way. Many tutorials do not provide an explanation for this $Q$ function, and some give an explanation based on Jensen's inequality, but I believe these approaches do not well highlight the essence of the EM algorithm.

Generally, the EM algorithm is used to optimize probabilistic problems containing latent variables. What is a latent variable? It's simple. Taking the previous classification problem as an example: the classification problem models $p(Y|X)$, which is also equivalent to $p(X, Y)$. We said we should use the Maximum Likelihood function as the objective, leading to equation $(8)$:

$$\theta = \mathop{\text{argmax}}_{\theta} \mathbb{E}_{X,Y}\big[\log p_{\theta}(X,Y)\big]\tag{8}$$

If labeled data pairs $(X, Y)$ are given, it is an ordinary supervised learning problem. However, what if only $X$ is given and $Y$ is not? In this case, $Y$ is called a latent variable. It exists, but we cannot see it, hence "latent."

GMM Model

Wait, you want to solve a classification problem without labels? Of course, it's possible. Isn't the GMM (Gaussian Mixture Model) such a model? In GMM, it is assumed that:

$$p_{\theta}(X,Y) = p_{\theta}(Y) p_{\theta}(X|Y)\tag{14}$$

Note that it is $p_{\theta}(Y) p_{\theta}(X|Y)$ and not $p_{\theta}(X) p_{\theta}(Y|X)$. The difference between the two is that it is difficult for us to directly estimate $p(X)$, and it is also difficult to directly guess the form of $p(Y|X)$. However, $p(Y)$ and $p(X|Y)$ are relatively easier because we usually assume that the meaning of $Y$ is the category, so $p(Y)$ is just a finite vector; while $p(X|Y)$ represents the distribution of objects within each class. Since these objects all belong to the same class, they should look more or less the same. Thus, GMM assumes it is a normal distribution. At this point, the assumptions made have a basis. Otherwise, if all data were mixed together, who would know what distribution to assume?

In this case, our complete data should be $(X, Y)$, but we do not have such pairs of samples $(X_1, Y_1), \dots, (X_n, Y_n)$ (otherwise it would reduce to supervised learning). We only know the samples of $X$: $X_1, \dots, X_n$. This corresponds to the situation we described in the section on KL divergence.

pLSA Model

Of course, unsupervised learning is not the only one with latent variables; supervised learning can have them too. For example, we could set:

$$p(Y|X)=\sum_{Z}p_{\theta}(Y|Z)p_{\theta}(Z|X)\tag{15}$$

Here an additional variable $Z$ appears. Even if labeled data pairs like $(X, Y)$ are given, there is still no data for $Z$. It is a variable we imagined, which is a latent variable. pLSA is such a problem. That is to say, the complete data pairs at this time should be in the form of $(X, Y, Z)$, but we only know partial samples like $(X_1, Y_1), \dots, (X_n, Y_n)$.

Bayesian School

Some readers might "let their imaginations run wild": Can the parameter $\theta$ also be seen as a latent variable? Congratulations, if you have this level of insight, you have entered the realm of the Bayesian school of thought. The Bayesian school believes that everything is random and everything follows some probability distribution, and the parameter $\theta$ is no exception. However, unfortunately, Bayesian probability theory is quite profound, and we cannot use it here yet. (In fact, more importantly, I don't understand it yet either~~)

EM Algorithm

Alright, no more rambling. Let's officially enter the discussion of the EM algorithm.

Joint KL Divergence

Let's look at the processing scheme for solving problems involving latent variables in general tutorials. Since latent variables are unobservable, the maximum likelihood of the marginal distribution (i.e., the distribution of observable variables) is generally used as the objective function:

$$\theta = \mathop{\text{argmax}}_{\theta}\sum_X \tilde{p}(X)\log\sum_Z p_{\theta}(X|Z)p_{\theta}(Z)\tag{16}$$

as the objective to be maximized.

This approach is not impossible, but it requires introducing more mathematical knowledge to derive the EM algorithm, and a rigorous proof requires lengthy derivations. In fact, starting from KL divergence and analyzing the KL divergence of the joint probability distribution can greatly simplify the derivation of the EM algorithm. Furthermore, if we use the maximum likelihood of the marginal distribution, we cannot intuitively understand the origin of the $Q$ function.

Taking GMM as an example, first let's calculate the KL divergence between $\tilde{p}(X,Y)$ and $p_{\theta}(X,Y)$:

$$\begin{aligned} &KL\Big(\tilde{p}(X,Y)\Big\Vert p_{\theta}(X,Y)\Big)\\ =& \sum_{X,Y}\tilde{p}(X,Y)\log \frac{\tilde{p}(X,Y)}{p_{\theta}(X,Y)}\\ =& \sum_{X}\tilde{p}(X)\sum_{Y} \tilde{p}(Y|X)\log \frac{\tilde{p}(Y|X)\tilde{p}(X)}{p_{\theta}(X|Y)p_{\theta}(Y)}\\ =& \mathbb{E}_X\left[\sum_{Y} \tilde{p}(Y|X)\log \frac{\tilde{p}(Y|X)\tilde{p}(X)}{p_{\theta}(X|Y)p_{\theta}(Y)}\right]\\ =& \mathbb{E}_X\left[\sum_{Y} \tilde{p}(Y|X)\log \frac{\tilde{p}(Y|X)}{p_{\theta}(X|Y)p_{\theta}(Y)}+\sum_{Y} \tilde{p}(Y|X)\log \tilde{p}(X)\right]\\ =& \mathbb{E}_X\left[\sum_{Y} \tilde{p}(Y|X)\log \frac{\tilde{p}(Y|X)}{p_{\theta}(X|Y)p_{\theta}(Y)}\right]+\mathbb{E}\left[\log \tilde{p}(X)\right]\\ =& \mathbb{E}_X\left[\sum_{Y} \tilde{p}(Y|X)\log \frac{\tilde{p}(Y|X)}{p_{\theta}(X|Y)p_{\theta}(Y)}\right]+\text{constant} \end{aligned}\tag{17}$$

Although this process is long, there are no circuitous transformations, and it is relatively easy to accept.

The EM Giant Arrives

Reviewing the origin of equation $(17)$, we hope to find a set of distribution parameters $\theta$ such that $KL\Big(\tilde{p}(X,Y)\Big\Vert p_{\theta}(X,Y)\Big)$ is as small as possible. $p_{\theta}(X,Y)$ has already been given in the form of $p_{\theta}(X|Y)p_{\theta}(Y)$, and only the parameter $\theta$ is unknown. However, in equation $(17)$, $\tilde{p}(Y|X)$ is also unknown, including its form.

At this point, the "Giant" speaks: first assume it is known. Then $\tilde{p}(Y|X)$ can be treated as a constant, at which point we can calculate the parameter $\theta$:

$$\begin{aligned}\theta^{(r)} =& \mathop{\text{argmin}}_{\theta} \mathbb{E}_X\left[\sum_{Y} \tilde{p}^{(r-1)}(Y|X)\log \frac{\tilde{p}^{(r-1)}(Y|X)}{p_{\theta}(X|Y)p_{\theta}(Y)}\right]\\ =& \mathop{\text{argmax}}_{\theta} \mathbb{E}_X\left[\sum_{Y}\tilde{p}^{(r-1)}(Y|X)\log p_{\theta}(Y) p_{\theta}(X|Y)\right]\end{aligned}\tag{18}$$

Then, having calculated a new $\theta^{(r)}$, we treat $p_{\theta}(X|Y)$ as known and solve for $\tilde{p}(Y|X)$:

$$\tilde{p}^{(r)}(Y|X) = \mathop{\text{argmin}}_{\tilde{p}(Y|X)} \mathbb{E}_X\left[\sum_{Y} \tilde{p}(Y|X)\log \frac{\tilde{p}(Y|X)}{p_{\theta^{(r)}}(X|Y)p_{\theta^{(r)}}(Y)}\right]\tag{19}$$

In fact, equation $(19)$ can be directly solved analytically. The answer is:

$$\tilde{p}^{(r)}(Y|X)=\frac{p_{\theta^{(r)}}(Y)p_{\theta^{(r)}}(X|Y)}{\sum\limits_Y p_{\theta^{(r)}}(Y)p_{\theta^{(r)}}(X|Y)}\tag{20}$$

Supplementary Derivation: The part within the square brackets of equation $(19)$ can be rewritten as:
$$\begin{aligned}&\sum_{Y} \tilde{p}(Y|X)\log \frac{\tilde{p}(Y|X)}{p_{\theta^{(r)}}(X,Y)}\\ =&\sum_{Y} \tilde{p}(Y|X)\log \frac{\tilde{p}(Y|X)}{p_{\theta^{(r)}}(Y|X)} - \sum_{Y} \tilde{p}(Y|X)\log p_{\theta^{(r)}}(X)\\ =& KL\Big(\tilde{p}(Y|X)\Big\Vert p_{\theta^{(r)}}(Y|X)\Big) - \text{constant} \end{aligned}$$
Therefore, minimizing equation $(19)$ is equivalent to minimizing $KL\Big(\tilde{p}(Y|X)\Big\Vert p_{\theta^{(r)}}(Y|X)\Big)$. According to the properties of KL divergence, the optimal solution is clearly that the two distributions are identical, i.e.,
$$\tilde{p}(Y|X) = p_{\theta^{(r)}}(Y|X) = \frac{p_{\theta^{(r)}}(Y)p_{\theta^{(r)}}(X|Y)}{\sum\limits_Y p_{\theta^{(r)}}(Y)p_{\theta^{(r)}}(X|Y)}$$
This gives equation $(20)$.

Because we cannot minimize $(17)$ in one step, we now train it alternately: first fix one part and maximize the other, then swap. The EM algorithm is an alternating training method for complex objective functions!

The combination of equations $(18)$ and $(20)$ constitutes the entire solving algorithm. Now look at equation $(18)$: it has an E (taking the expectation) and an M ($\mathop{\text{argmax}}$), so let's call it the EM algorithm. Let's call the expression for which the expectation is taken the $Q$ function. Thus, the EM Giant has appeared, and the $Q$ function has appeared, just like that...

Of course, the original meaning of E in the EM algorithm is to view $\sum\limits_{Y}\tilde{p}^{(r-1)}(Y|X)\log p_{\theta}(Y) p_{\theta}(X|Y)$ as calculating the expectation for the latent variable $Y$. We're being a bit more casual here, as long as the conclusion is correct~

Does it feel sudden? It feels like we didn't do much, and the EM algorithm was explained in just two sentences? Including the derivation?

What Is Going On?

The EM algorithm for pLSA or other models containing latent variables can be derived similarly. Comparing the EM algorithm derivations I can currently find, I believe the above process is quite concise. Although there was a lot of groundwork laid earlier, they are really just basic knowledge.

How was this achieved? Reviewing the whole process, we didn't really do anything; we simply used KL divergence as a measure of the difference between joint distributions and then minimized the KL divergence alternately~ Applying the derivation this way turned out to be much faster and more direct than starting from the maximum likelihood of the marginal distribution, which was a pleasant surprise.

Consistent Understanding

This article reflects the author's thoughts on the Principle of Maximum Likelihood. The general idea is to start from the principles and forms of maximum likelihood and induce results for supervised/unsupervised learning, hoping to use a unified idea to thread together various related contents. In the end, I found the results quite satisfying, especially the EM algorithm section. From now on, one only needs to remember that the root of everything is the maximum likelihood of (joint) distributions or KL divergence, and there is no longer a need to memorize the form of the Q function in the EM algorithm by heart.

Of course, some viewpoints in the article are "what I think," so there may be inappropriate parts. Readers are asked to discern carefully. However, I can guarantee that the results are the same as existing ones. Readers are welcome to continue the conversation~