Entropy Explained: From Entropy and the Maximum Entropy Principle to Maximum Entropy Models (Part 1)

By 苏剑林 | December 1, 2015

The Concept of Entropy

As a physics enthusiast, I have always been fascinated and curious about the concept of "entropy" in statistical mechanics. Therefore, when I encountered data science, I developed a strong interest in maximum entropy models as well.

What is entropy? In popular introductions, entropy is generally given two explanations: (1) entropy is a measure of uncertainty; (2) entropy is a measure of information. While they might seem different, they actually refer to the same thing. First, entropy as a measure of uncertainty quantifies our "degree of ignorance" about a certain phenomenon. Why is entropy also a measure of information? Since entropy represents our ignorance, the process of going from "ignorance" to "full knowledge" results in obtaining a certain amount of information. The more ignorant we are at the start, the greater the amount of information we gain upon reaching "full knowledge." Thus, entropy, as a measure of uncertainty, can also be seen as a measure of information—more precisely, the maximum amount of information we can potentially obtain from it.

How do we measure uncertainty? In other words, how do we calculate entropy? To answer this, we can consider that we use probability to describe uncertainty. We can provide the probability (or probability density, without distinguishing them here) of a system being in a certain state. As long as the probability is not 1, uncertainty exists. Therefore, entropy must be related to the probability distribution. How exactly is entropy calculated? Usually, textbooks or popular science works directly provide the formula for entropy. Here, the author attempts to present the origin of the entropy formula from two perspectives, aiming to help readers understand its roots and development.

The fundamental formula for calculating entropy is: \begin{equation} S=-\sum_x p(x)\log p(x)\tag{1} \end{equation} Here, $\log$ can be the natural logarithm or the base-2 logarithm. In fact, any base greater than 1 works, as changing the base is merely equivalent to changing the units of information. $p(x)$ is the probability that the variable $X$ takes the value $x$. For continuous probability distributions, entropy is similarly defined (replacing the summation with an integral): \begin{equation} S=-\int p(x)\log p(x) dx\tag{2} \end{equation} where $p(x)$ is the probability density function of $X$.

Please note several features of equations $(1)$ and $(2)$: the logarithm $\log$, the summation $\sum_x$, and the negative sign in front. Below, I will focus on explaining these features from two perspectives: one common-sense and one more abstract (mathematical).

A Common-Sense Perspective

Let's start with the following example:

After the World Cup ends, everyone cares about who the champion is. Suppose I missed watching the World Cup. After the match, I ask a spectator who knows the result, "Which team is the champion?" He is unwilling to tell me directly but wants me to guess. For every guess I make, he charges me one dollar to tell me whether I guessed correctly. How much money do I need to pay him to find out who the champion is? I can number the teams from 1 to 32 and then ask: "Is the champion team among numbers 1-16?" If he tells me I'm right, I then ask: "Is the champion among numbers 1-8?" If he tells me I'm wrong, I naturally know the champion is among 9-16. In this way, it only takes 5 guesses for me to know which team is the champion. Therefore, the information content of "who is the World Cup champion" is worth only 5 dollars.

The above example is cited from Dr. Wu Jun's book The Beauty of Mathematics, in the chapter "Talking about the Maximum Entropy Model." The premise of this example is that "I" know nothing about football—I haven't studied it, nor have I heard any news about it. In this case, I can only rely on guessing. The most primitive way is to guess team by team: Is it China? Is it Brazil? Is it Japan? ... This way, we might ask up to 31 questions to get the final result. Obviously, this is not the most efficient method. In the example, we first number the teams (in data processing, we call this creating an "index") and then use binary search, which has an efficiency of $\mathcal{O}(\log_2 N)$.

This gives us two insights: First, creating an index and using binary search can greatly speed up the search process, though this isn't the main point of this article. Second, $\log_2 N$—the logarithm appears! It can be rewritten as: $$\log_2 N=-\log_2 \frac{1}{N}=-\sum_{N\text{ teams}}\frac{1}{N}\log_2 \frac{1}{N}$$ This is exactly the form of equation $(1)$. Here, because we know nothing about football, the probability of each team winning the championship is the same, which is $p=1/N$.

Readers might feel this example is too specific and the result lacks representativeness. Indeed, this is just an intuitive realization. It is also possible to understand it from a more abstract and precise mathematical perspective. In that case, our logic is reversed.

An Abstract Perspective

First, we hope to construct a formula to represent the amount of information, or equivalently, the degree of uncertainty, called "entropy." (Note: our goal here is to construct a quantity based on the intended use, rather than having the quantity first and proving its use. The logic here is the exact opposite: we construct what we need.) Since "entropy" is used to represent the amount of information, it should possess the following simple properties:

1. It is a function of the probability distribution $p(x)$. For convenience of study, we also hope it is a smooth function;
2. It is additive, which means entropy has the form: \begin{equation}S[p(x)]=\sum_{x}f(p(x))\tag{3}\end{equation}

Now the question is, what is the specific form of $f()$? We need some additional information to determine it. For example, suppose $X$ and $Y$ are two independent random variables with probability distributions $p(x)$ and $p(y)$, respectively. Their joint probability is $p(x)p(y)$. Because the two random variables are independent, the information contained in the joint distribution $p(x)p(y)$ is equivalent to the sum of the information in the individual distributions $p(x)$ and $p(y)$ (from the joint density distribution, one can calculate individual density distributions, and vice versa). That is:

3. When $X$ and $Y$ are independent random variables, we have: \begin{equation}S[p(x)p(y)]=S[p(x)]+S[p(y)]\tag{4}\end{equation}

In fact, the three constraints 1, 2, and 3 are sufficient to determine the expression for entropy. To determine the form of $f()$ from equation $(4)$, we need only start from the simplest binary distribution. Suppose $X$ and $Y$ are both binary random variables. $X$ has the probability distribution $p, 1-p$, and $Y$ has the probability distribution $q, 1-q$. Then the joint distribution is $pq, p(1-q), (1-p)q, (1-p)(1-q)$. According to equation $(4)$, we have: \begin{aligned} &f(pq)+f\big(p(1-q)\big)+f\big((1-p)q\big)+f\big((1-p)(1-q)\big)\\ =&f(p)+f(1-p)+f(q)+f(1-q) \end{aligned}\tag{5}

This is a functional equation for $f()$. As long as appropriate and reasonable constraints are added, it has a unique solution. We attempt to solve it here without proving the uniqueness of the solution. The solution process is exploratory. We notice that the left side contains products of independent variables, such as $pq$, while the right side contains single variables, such as $p$. Recalling mathematical concepts, we remember that the operation capable of turning products into sums is taking a logarithm. Thus, we might as well set $f(x)=h(x)\ln x$, which yields: \begin{aligned} &h(p)\ln p+h(1-p) \ln (1-p)+h(q)\ln q+h(1-q)\ln (1-q)\\ =&h(pq)\ln p+h(pq)\ln q \\ &+ h\big(p(1-q)\big)\ln p + h\big(p(1-q)\big)\ln(1-q)\\ &+h\big((1-p)q\big)\ln(1-p)+h\big((1-p)q\big)\ln q \\ &+ h\big((1-p)(1-q)\big)\ln(1-p)+h\big((1-p)(1-q)\big)\ln(1-q) \end{aligned}\tag{6}

Too long, messy, and dizzying? Don't worry, we are near the finish line. We group terms with the same logarithm together. For example, the $\ln p$ terms are: \begin{equation}\big[h(p)-h(pq)-h(p(1-q))\big]\ln p\tag{7}\end{equation} The remaining three terms are similar. We find that if $h()$ is a linear function, the above expression is exactly 0, and the remaining three terms are also 0; the equation is automatically satisfied! Thus, we have found a solution: \begin{equation}f(x)=\alpha x\ln x\tag{8}\end{equation} Therefore, we have the expression for entropy: \begin{equation}S=\sum_{x}\alpha p(x)\ln p(x)\tag{9}\end{equation}

Finally, we need to determine $\alpha$. Of course, the value of $\alpha$ itself is not important—it is merely the unit of information—but the sign of $\alpha$ is crucial. We require entropy to have the following property:

4. The greater the amount of information, the greater the entropy.

Point 4 is simply to align with our conventions. If you preferred, you could define "the greater the amount of information, the smaller the entropy." Given this definition, we know that the entropy of a certain event must be smaller than that of an uncertain event (uncertain events contain more information). For a certain event, the probability distribution is always 1, and the corresponding entropy is 0. For an uncertain event, let's take a binary distribution with probabilities $p, 1-p$. Then we must have: \begin{equation}\alpha p\ln p + \alpha (1-p)\ln (1-p) > 0\tag{10}\end{equation} Therefore, we must have $\alpha < 0$.

A Little More Talk

Now that we have obtained the expression for entropy, we have come quite far. Readers may not necessarily need all of this. I simply wanted to share my understanding of the origins and development of entropy. Readers who are concerned with the applications of entropy rather than its origins can ignore this part. However, for friends who wish to deeply understand entropy from multiple angles, I believe this content is a meaningful reference.

To emphasize again, this article determines the expression for entropy through constraints 1, 2, 3, and 4. These are properties of entropy, and the expression is derived from them. General textbooks often do the opposite: they first provide the entropy expression and then derive properties like 1, 2, 3, and 4. In fact, this can feel like putting the cart before the horse.

Based on the discussion above, we have found that entropy is no longer just an abstraction of a physical concept; it has become a completely independent object. Entropy originated in physics but has essentially outgrown it, becoming a powerful tool that spans fields like information, physics, and biology. In fact, as an application in physics, we can work backward from the maximum entropy principle discussed later to establish physical laws. In that case, entropy is not just a derivative but the source of physical laws themselves.

"Derivatives" of Entropy

With the definitions of entropy in $(1)$ and $(2)$, we can obtain some "derivatives," such as "joint entropy": \begin{equation}S[p(x,y)]=-\sum_{x}\sum_{y} p(x,y)\ln p(x,y)\tag{11}\end{equation} This is merely an equivalent extension of univariate entropy.

For the maximum entropy model discussed later, we need to introduce conditional entropy, which is related to the conditional distribution $p(y|x)$. However, I feel that many textbooks make this overly complicated, leaving readers in a fog. For example, some tutorials directly define conditional entropy as: \begin{equation}S(Y|X)=\sum_x p(x)S(Y|X=x)=-\sum_x p(x)\sum_y p(y|x)\log p(y|x)\tag{12}\end{equation} Others define it as: \begin{equation}S(Y|X)=-\sum_x\sum_y p(x,y)\log p(y|x)\tag{13}\end{equation} These two formulas are equivalent and, of course, correct. But I must ask: what kind of definition is this? A reader's first reaction might be: On what basis is it defined this way? Are you just defining it however you want? Are you messing with me?

I truly don't understand why the most direct and easiest-to-understand definition isn't provided. We already know that a conditional distribution is based on the joint distribution $p(x,y)$ where $p(x)$ is already known, seeking the distribution of $Y$ when $X$ is determined. Naturally, conditional entropy is the entropy remaining from the joint entropy after introducing the entropy of $X$: \begin{equation}S(Y|X)=S[p(x,y)]-S[p(x)]\tag{14}\end{equation} Simply put, conditional entropy means that while the total uncertainty was $S[p(x,y)]$, $p(x)$ provides an amount of information $S[p(x)]$, reducing the uncertainty. The remaining uncertainty is the conditional entropy. It is not difficult to prove that equation $(14)$ is equivalent to $(12)$ and $(13)$, but clearly, $(14)$ carries the most obvious meaning.