Already used CRF? Why not learn about the faster MEMM?

By 苏剑林 | February 24, 2020

HMM, MEMM, and CRF are known as the three classic probabilistic graphical models. In the era of machine learning before deep learning, they were widely used in various sequence labeling-related tasks. An interesting phenomenon is that in the deep learning era, HMM and MEMM seem to have "declined," leaving only CRF on the stage. I believe readers working in NLP have likely heard of, if not personally implemented, BiLSTM+CRF for tasks like Chinese word segmentation and named entity recognition, but almost never hear of BiLSTM+HMM or BiLSTM+MEMM. Why is that?

Today, let's study MEMM and, through a comparison with CRF, gain a deeper understanding of the principles and design of probabilistic graphical models.

Model Derivation

MEMM stands for Maximum Entropy Markov Model. It must be said that this name might intimidate 80% of beginners: "I don't even understand Maximum Entropy, and I don't know Markov—combined, aren't they just gibberish?" In fact, whether it's MEMM or CRF, their models are much simpler than their names suggest. Their concepts and designs are very simple and natural, and not difficult to understand.

Recalling CRF

For comparison, let's review CRF. I say "review" because I have previously written an article introducing CRF. If you are not yet familiar with CRF, you can read the earlier post "A Simple Introduction to Conditional Random Fields (CRF) with a Pure Keras Implementation". For simplicity, the CRF and MEMM introduced in this article are the simplest "linear-chain" versions.

This article uses sequence labeling as an example: given an input sequence $\boldsymbol{x}=(x_1,x_2,\dots,x_n)$, we want to output a label sequence of the same length $\boldsymbol{y}=(y_1,y_2,\dots,y_n)$. Thus, we are modeling the probability distribution:

\begin{equation}P(\boldsymbol{y}|\boldsymbol{x})=P(y_1,y_2,\dots,y_n|\boldsymbol{x})\label{eq:target}\end{equation}

CRF treats $\boldsymbol{y}$ as a whole, calculating a total score with the following formula:

\begin{equation}\begin{aligned}f(y_1,y_2,\dots,y_n;\boldsymbol{x})=&\,f(y_1;\boldsymbol{x})+g(y_1,y_2)+\dots+g(y_{n-1},y_n)+f(y_n;\boldsymbol{x})\\ =&\,f(y_1;\boldsymbol{x}) + \sum_{k=2}^n \big(g(y_{k-1},y_k)+f(y_k;\boldsymbol{x})\big)\end{aligned}\end{equation}

The characteristic of this scoring function is that it explicitly considers the connection between adjacent labels. $g(y_{k-1},y_k)$ is usually referred to as the transition matrix. Now that the score is calculated, the probability is the softmax of the score, so the final probability distribution is set as:

\begin{equation}P(\boldsymbol{y}|\boldsymbol{x})=\frac{e^{f(y_1,y_2,\dots,y_n;\boldsymbol{x})}}{\sum\limits_{y_1,y_2,\dots,y_n}e^{f(y_1,y_2,\dots,y_n;\boldsymbol{x})}}\label{eq:crf-p}\end{equation}

If we limit ourselves to the concept, the introduction to CRF ends here. In summary, the target sequence is treated as a whole; a scoring function is designed for the target, and then a global softmax is performed on this scoring function. This modeling concept is consistent with ordinary classification problems. The difficulty of CRF lies in its code implementation, as the denominator term in the above equation involves summing over all possible paths, which is not easy. However, in terms of conceptual understanding, I believe there are no particular difficulties.

The Simpler MEMM

Now let's introduce MEMM, which can be seen as an extremely simplified seq2seq model. For the target in $\eqref{eq:target}$, it considers the decomposition:

\begin{equation}P(y_1,y_2,\dots,y_n|\boldsymbol{x})=P(y_1|\boldsymbol{x})P(y_2|\boldsymbol{x},y_1)P(y_3|\boldsymbol{x},y_1,y_2)\dots P(y_n|\boldsymbol{x},y_1,y_2,\dots,y_{n-1})\end{equation}

It then assumes that dependencies between labels occur only at adjacent positions, thus:

\begin{equation}P(y_1,y_2,\dots,y_n|\boldsymbol{x})=P(y_1|\boldsymbol{x})P(y_2|\boldsymbol{x},y_1)P(y_3|\boldsymbol{x},y_2)\dots P(y_n|\boldsymbol{x},y_{n-1})\label{eq:p-f}\end{equation}

Next, mimicking the design of the linear-chain CRF, we can set:

\begin{equation}P(y_1|\boldsymbol{x})=\frac{e^{f(y_1;\boldsymbol{x})}}{\sum\limits_{y_1}e^{f(y_k;\boldsymbol{x})}},\quad P(y_k|\boldsymbol{x},y_{k-1})=\frac{e^{g(y_{k-1},y_k)+f(y_k;\boldsymbol{x})}}{\sum\limits_{y_k}e^{g(y_{k-1},y_k)+f(y_k;\boldsymbol{x})}}\label{eq:memm}\end{equation}

With this, we have obtained MEMM. Since MEMM has already decomposed the global probability distribution into a product of step-by-step distributions, calculating the loss only requires summing the cross-entropy of each step.

The Relationship Between the Two

Substituting $\eqref{eq:memm}$ back into $\eqref{eq:p-f}$, we get:

\begin{equation}P(\boldsymbol{y}|\boldsymbol{x})=\frac{e^{f(y_1;\boldsymbol{x})+g(y_1,y_2)+\dots+g(y_{n-1},y_n)+f(y_n;\boldsymbol{x})}}{\left(\sum\limits_{y_1}e^{f(y_1;\boldsymbol{x})}\right)\left(\sum\limits_{y_2}e^{g(y_1,y_2)+f(y_2;\boldsymbol{x})}\right)\dots\left(\sum\limits_{y_n}e^{g(y_{n-1},y_n)+f(y_n;\boldsymbol{x})}\right)}\label{eq:memm-p}\end{equation}

Comparing $\eqref{eq:memm-p}$ with $\eqref{eq:crf-p}$, we can see that the only difference between MEMM and CRF is the calculation of the denominator (i.e., the normalization factor). We call CRF in $\eqref{eq:crf-p}$ globally normalized, while MEMM in $\eqref{eq:memm-p}$ is locally normalized.

Model Analysis

In this section, we analyze the strengths, weaknesses, improvements, and performance of MEMM.

Pros and Cons of MEMM

One obvious feature of MEMM is its simple implementation and high speed. Since it only needs to perform softmax at each step independently, MEMM is fully parallelizable, and its speed is basically the same as direct step-by-step Softmax. For CRF, the denominator in $\eqref{eq:crf-p}$ is not so easy to compute; it eventually transforms into a recursive calculation that can be solved in $\mathcal{O}(n)$ time (for details, please refer to "A Simple Introduction to Conditional Random Fields (CRF) with a Pure Keras Implementation"). Recursion implies serial processing, so when the main part of our model is a highly parallelizable architecture (such as pure CNN or pure Attention architectures), CRF will significantly slow down the model's training speed. Later, we will compare the training speeds of MEMM and CRF (of course, it's only the training that is slower; in the prediction phase, MEMM and CRF have the same speed).

As for disadvantages, they naturally exist. As mentioned earlier, MEMM can be seen as an extremely simplified seq2seq model. Since this is the case, it inherits all the drawbacks of ordinary seq2seq models. There is a prominent problem in seq2seq called exposure bias, which corresponds to "label bias" in MEMM. Roughly speaking: when training MEMM, for the current step's prediction, it is assumed that the previous step's true label is known. Consequently, if a specific label $A$ can only be followed by label $B$, the model only needs to optimize the transition matrix to achieve this, without needing to optimize the influence of the input $\boldsymbol{x}$ on $B$ (i.e., $f(B;\boldsymbol{x})$ is not well optimized). However, during the prediction phase, the true labels are unknown. We might not be able to predict label $A$ in the previous step with high confidence. Since $f(B;\boldsymbol{x})$ was not strengthened during training, current step $B$ cannot be accurately predicted either, which may lead to incorrect prediction results.

Bidirectional MEMM

Label bias might be difficult to understand, but we can look at MEMM's deficiency from another perspective: compared to CRF, MEMM has a noticeably less "elegant" aspect—its asymmetry, as it decomposes probabilities from left to right. My experiments indicate that if this asymmetry can be resolved, the performance of MEMM can be slightly improved. My approach is to perform MEMM from right to left as well. At this point, the corresponding probability distribution is:

\begin{equation}P(\boldsymbol{y}|\boldsymbol{x})=\frac{e^{f(y_1;\boldsymbol{x})+g(y_1,y_2)+\dots+g(y_{n-1},y_n)+f(y_n;\boldsymbol{x})}}{\left(\sum\limits_{y_n}e^{f(y_n;\boldsymbol{x})}\right)\left(\sum\limits_{y_{n-1}}e^{g(y_n,y_{n-1})+f(y_{n-1};\boldsymbol{x})}\right)\dots\left(\sum\limits_{y_1}e^{g(y_2,y_1)+f(y_1;\boldsymbol{x})}\right)}\end{equation}

Then, calculate a cross-entropy and average it with the left-to-right cross-entropy from $\eqref{eq:memm-p}$ to serve as the final loss. In this way, the model considers both directions simultaneously without adding parameters, compensating for the defect of asymmetry. To distinguish it, I analogize the naming of Bi-LSTM and call it Bi-MEMM.

Note: The term Bi-MEMM did not first appear here. To the best of my knowledge, the first to propose the concept of Bi-MEMM was the paper "Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data". The Bi-MEMM in that paper refers to a bidirectional decoding strategy for MEMM, which is different from the meaning of Bi-MEMM in this blog.

Experimental Result Demonstration

To verify and compare the effects of MEMM, I have implemented both CRF and MEMM in bert4keras and wrote two scripts for Chinese word segmentation (task_sequence_labeling_cws_crf.py) and Chinese named entity recognition (task_sequence_labeling_ner_crf.py). In these scripts, switching from CRF to MEMM is very simple: just replace ConditionalRandomField with MaximumEntropyMarkovModel.

I won't post detailed experimental data; they're just numbers anyway. Here are some relative comparison results:

1. Under the same experimental parameters, Bi-MEMM is always better than MEMM, and MEMM is always better than Softmax;

2. Under the same experimental parameters, CRF is basically no worse than Bi-MEMM;

3. When the encoding model's capacity is strong, CRF and Bi-MEMM perform equally; when the encoding model is weaker, CRF is superior to Bi-MEMM by about 0.5%;

4. Using a 12-layer BERT-base model as the encoding model, Bi-MEMM is 25% faster than CRF; using a 2-layer BERT-base model, Bi-MEMM is 1.5 times faster than CRF.

(Note: Since I found that Bi-MEMM always performs slightly better than MEMM and their training times are virtually identical, MaximumEntropyMarkovModel in bert4keras defaults to Bi-MEMM.)

Thoughts and Extensions

Based on the above conclusions, the "decline" of MEMM in the deep learning era seems understandable—besides being faster to train, MEMM doesn't seem to offer much advantage over CRF. Their prediction speeds are the same, and often we care most about prediction speed and performance; a slightly slower training time is acceptable. The comparison between these two models is representative; it essentially reflects the difference between all globally normalized and locally normalized models: globally normalized models usually perform better but are relatively harder to implement; locally normalized models usually don't outperform global ones but excel in ease of implementation and extensibility.

How are they easier to extend? Here are two examples of directions.

The first example: suppose the number of labels is very large, for example, when using sequence labeling for text correction or text generation (relevant examples can be found in the paper "Fast Structured Decoding for Sequence Models"). The number of labels is the vocabulary size $|V|$. Even with subword methods, there are often tens of thousands of words. At this point, the parameter count for the transition matrix reaches hundreds of millions ($|V|^2$), making it difficult to train. Readers might think of low-rank decomposition; indeed, low-rank decomposition can keep the parameter count of the transition matrix at $2d|V|$, where $d$ is the dimension of the decomposition layer. Unfortunately, for CRF, low-rank decomposition does not change the fact that calculating the normalization factor is computationally expensive because the CRF normalization factor still needs to be reconstructed into a $|V| \times |V|$ transition matrix. Therefore, in scenarios with massive label counts, CRF cannot be used directly. Fortunately, for MEMM, low-rank decomposition effectively reduces the computation during training, making it still usable. MaximumEntropyMarkovModel in bert4keras has already integrated low-rank decomposition; interested readers can check the source code for details.

The second example: the CRF and MEMM introduced above only consider dependencies between adjacent labels. What if I want to consider more complex neighborhood dependencies? For instance, considering the relationship of $y_k$ with both $y_{k-1}$ and $y_{k-2}$? In this case, the global normalization approach of CRF is very difficult to handle, ultimately because the normalization factor is hard to compute. However, with the local normalization approach of MEMM, it is easy to proceed. In fact, the hierarchical labeling approach for information extraction I designed earlier can also be said to be a locally normalized probabilistic graphical model similar to MEMM, and it considers even more complex dependencies.

Article Summary

This article introduced and slightly extended MEMM, which is a classic case of probabilistic graphical models alongside CRF. The main difference between it and CRF is the normalization method. Subsequently, I compared the two experimentally, concluding that MEMM trains faster but does not outperform CRF. Nevertheless, I believe MEMM still has its merits, so I concluded by brainstorming some extensions for MEMM.