“Entropy” Can't Afford It: From Entropy and the Maximum Entropy Principle to the Maximum Entropy Model (Part 3)

By 苏剑林 | December 20, 2015

Recap of the Previous Episode #

In the previous article, I shared my understanding of the Maximum Entropy Principle, including its significance, methods for solving it, and some simple yet common applications. At the end of the previous post, we also derived the Normal distribution using the Maximum Entropy Principle to illustrate its profound implications and broad significance.

In this article, I will introduce a model based on the Maximum Entropy Principle — the Maximum Entropy Model. This article introduces the Maximum Entropy Model through the lens of a supervised classification problem. "Supervised" means that the model is built based on data that has already been labeled.

In fact, the Maximum Entropy Principle discussed in the second article is the primary concept; the Maximum Entropy Model is essentially just an extension or application of that principle.

Maximum Entropy Model #

Classification: What Does It Mean? #

Before introducing the Maximum Entropy Model, let's discuss what a classification problem entails. Suppose we have a batch of labeled data:

$$\begin{array}{c|cccccccc} \hline \text{Data } x & 1 & 2 & 3 & 4 & 5 & 6 & \dots & 100 \\ \hline \text{Label } y & 1 & 0 & 1 & 0 & 1 & 0 & \dots & 0\\ \hline \end{array}$$

A classification problem means providing a model where, if I tell the model $x$, the model can tell me the corresponding $y$. Therefore, it is natural to view the classification problem as a fitting problem — that is, finding a function $y=f(x)$ to fit this batch of data. Many functions can be used for fitting, ranging from simple linear functions (linear regression) and Logistic functions (Logistic regression) to complex multilayer neural networks. All of these resolve classification by treating it as a fitting problem.

However, this is not the only perspective. We can also look at classification from a probabilistic point of view. Since there is almost nothing certain in this world, when we input $x$, the output value $y$ is not deterministic. For example, in the data above, when we input 1, it outputs 1 or 0 with a certain probability, it's just that the probability of 1 is higher. In this case, the classification problem becomes a matter of finding the probability distribution of $y$ given $x$, which is to find the conditional distribution $p(Y|X)$.

As we learned in the previous article, the Maximum Entropy Principle is naturally designed to find probability distributions. Therefore, the Maximum Entropy Principle naturally integrates with classification problems to produce the Maximum Entropy Model.

Maximum Entropy Model #

Now, the classification problem has been transformed into finding the conditional distribution $p(Y|X)$. We still use the Maximum Entropy Principle to solve it. The difference is that since we are dealing with a conditional distribution, the entropy we need to maximize is the conditional entropy (refer to Part 1):

\begin{equation} S(Y|X)=S[p(x,y)]-S[p(x)] \tag{37} \end{equation}

And since

\begin{equation} p(y|x)=\frac{p(x,y)}{p(x)} \tag{38} \end{equation}

Finding $p(x,y)$ and $p(x)$ allows us to find the conditional distribution $p(Y|X)$. Here, both $p(x,y)$ and $p(x)$ are unknown. However, we assume that through extensive statistics, we can derive the empirical distribution $\tilde{p}(x)$ of $x$ (since $x$ is merely the data collected, this statistical estimation is theoretically feasible). Thus, the only unknown distribution is $p(x,y)$. We concentrate on solving for it. At this point, maximizing equation $(37)$ is equivalent to maximizing $S[p(x,y)]$.

How do we determine the relationship between $x$ and $y$? Simple: Statistics! By analyzing the labeled data, we can derive the probabilistic relationships between them. To describe this mathematically, we need to define feature functions:

\begin{equation} \chi(x,y)=\left\{\begin{aligned}&1,\quad x,y \text{ satisfy a certain fact}\\ &0,\quad x,y \text{ do not satisfy a certain fact}\end{aligned}\right. \tag{39} \end{equation}

For example, in the labeled data table provided above, we notice that when $x$ is odd, $y=1$, and when $x$ is even, $y=0$. Thus, we suspect the result is related to parity. Therefore, we can define a feature function:

\begin{equation} \chi(x,y)=\left\{\begin{aligned}&1,\quad \text{when } x \text{ is odd and } y=1\\ &0,\quad \text{otherwise}\end{aligned}\right. \tag{40} \end{equation}

Next, we count the number of instances $N_{\chi}$ that satisfy the feature function and divide by the total number of samples $N$ to get $\tau = N_{\chi}/N$. We believe that when there is enough data, this "count" represents the average result:

\begin{equation} E[\chi(x,y)]=\sum_{x,y} p(x,y)\chi(x,y)=\tau \tag{41} \end{equation}

We can define multiple features to obtain multiple feature functions and multiple statistical results:

\begin{equation} \left\{\begin{aligned}&E[\chi_1(x,y)]=\sum_{x,y} p(x,y)\chi_1(x,y)=\tau_1\\ &\vdots\\ &E[\chi_k(x,y)]=\sum_{x,y} p(x,y)\chi_k(x,y)=\tau_k\end{aligned}\right. \tag{42} \end{equation}

These become constraints for the maximum entropy. Thus, it becomes a constrained maximization problem. In the second part, we already provided the result (similar to equation $(20)$):

\begin{equation} p(x,y)=\frac{1}{Z}\exp\left(-\sum_{i=1}^k \lambda_i \chi_i (x,y)\right) \tag{43} \end{equation}

where

\begin{equation} Z=\sum_{x,y} \exp\left(-\sum_{i=1}^k \lambda_i \chi_i (x,y)\right) \tag{44} \end{equation}

Note that we have solved for $p(x,y)$ here. If we are only interested in $p(y|x)$, it is better to convert it to the $p(y|x)$ form, namely:

\begin{equation} p(y|x)=\frac{p(x,y)}{p(x)}=\frac{1}{Z(x)}\exp\left(-\sum_{i=1}^k \lambda_i \chi_i (x,y)\right) \tag{45} \end{equation}

Here, $Z(x)=Z \times p(x)$ is the normalization factor related to $x$.

A Brief Discussion on Model Application #

Next is the issue of model solving. As mentioned, models related to the Maximum Entropy Principle have characteristics such as simple forms and strong adaptability. However, it has a fatal flaw — it is difficult to solve, and generally can only be solved through numerical methods. I do not have particular insights in this area, so I won't go into detail. Interested readers can refer to other literature, such as here.

Let's talk more carefully about how to use the Maximum Entropy Model. Suppose we extract a feature, and this feature has $N$ possible values (for example, the feature is the part-of-speech of a word, with values like verb, noun, adjective, etc.), and we are performing binary classification (such as sentiment analysis, positive or negative). In that case, we can actually construct $2N$ feature functions (verb-positive, noun-positive, verb-negative, noun-negative, and other combinations). Furthermore, if multiple features are extracted, the number of feature functions will be quite substantial. Therefore, the main work of the Maximum Entropy Model lies in (manual) feature extraction. Once feature extraction is completed, the Maximum Entropy Model provides an optimal scheme for utilizing those features (the entropy maximization scheme).

Therefore, although the Maximum Entropy Model has strong adaptability (because it can solve for probability distributions based on any arbitrary constraints we propose, with no theoretical limit on the number of constraints) and yields very good results (if it can be solved, the effects are usually excellent), its application is relatively narrow because we lack general means to extract features effectively. Consequently, the Maximum Entropy Model is rarely used directly for general classification problems. Especially after deep learning algorithms became popular, the powerful fitting capabilities and excellent performance of multilayer neural networks mean that we almost never use the Maximum Entropy Model in standard classification problems anymore.

So when is this Maximum Entropy Model used? We generally "pick out" problems suitable for the Maximum Entropy Model (generally, where the feature values are binary or there aren't many features). In these cases, it is suitable to use the Maximum Entropy Model, and as a standalone model, its results are generally better than other models (please recall the significance of maximum entropy).

Conclusion #

This series concludes here. Many online articles or books also explain the derivation and solving of the Maximum Entropy Model, but I still feel that the way it's taught in textbooks is not clear or explicit enough. Aspects such as the definition of entropy, the meaning of entropy, the significance of maximum entropy, and the derivation of the Maximum Entropy Model were not entirely satisfying to me. Therefore, combining my own understanding, I wrote these three brief articles, both as my own notes and as a way to share my understanding with everyone. If there are any inaccuracies, I welcome corrections from readers.

In this series, the Maximum Entropy Model served only as a springboard. In fact, the focus of the three articles is the concept of entropy and the Maximum Entropy Principle; they are what TRULY deserve repeated contemplation. Entropy quantifies information and uncertainty, and we can use it to do many things. Although this series has ended, content regarding information and entropy will still appear frequently in future posts on this blog, because there is simply too much wonderful and necessary content to say about entropy.

Regarding the Maximum Entropy Model, readers can also refer to Mr. Wu Jun's "The Beauty of Mathematics" (数学之美). The book contains two chapters introducing the Maximum Entropy Model, along with many other data analysis materials and stories, all of which are well worth reading.