CRF In A Nutshell

By 苏剑林 | November 25, 2017

This article aims to explain the principles of CRF (Conditional Random Field) using language that is as concise as possible. The term "In A Nutshell" carries meanings such as "introduction" or "popular science" (Hawking wrote a book called The Universe in a Nutshell, and I am clumsily following suit here).

Most articles introducing CRF on the web, whether in Chinese or English, usually start by discussing concepts of probability graphs and then introduce the exponential formula for features, claiming it as CRF. "Probability graphs" are merely an illustrative way of understanding; however, if the principles are not clear, providing too many figurative analogies will only confuse the reader, making it seem like you are just being pretentious. (Speaking of which, I want to vent again: solving a neural network is clearly about calculating a gradient and then iterating—this is so easy to understand—yet people gave it a pretentious name like "backpropagation." If you don't clearly state that its essence is differentiation and iterative solving, how many readers will truly understand when you just say backpropagation?)

Alright, enough with the filler. Let's get to the main point.

Per-tag Softmax

CRF is commonly found in sequence labeling tasks. Suppose our model input is $Q$ and the output target is a sequence $a_1, a_2, \dots, a_n$; then, according to our usual modeling logic, we naturally hope to maximize the probability of the target sequence: $$P(a_1, a_2, \dots, a_n | Q)$$ Whether using traditional methods or deep learning methods, modeling the complete sequence directly is difficult. Therefore, we usually use some assumptions to simplify it. For example, using the naive assumption gives us: $$P(a_1, a_2, \dots, a_n | Q) = P(a_1 | Q) P(a_2 | Q) \dots P(a_n | Q)$$ Note that $Q$ here is not necessarily the original input; for instance, it might be the hidden outputs $q_1, q_2, \dots, q_n$ after multiple layers of LSTM. We assume that global correlations have already been captured by the preceding model, so in the final step, we can assume the features are independent of each other, thus: \[\begin{aligned} P(a_1 | Q) &= P(a_1 | q_1, q_2, \dots, q_n) = P(a_1 | q_1) \\ P(a_2 | Q) &= P(a_2 | q_1, q_2, \dots, q_n) = P(a_2 | q_2) \\ &\vdots \\ P(a_n | Q) &= P(a_n | q_1, q_2, \dots, q_n) = P(a_n | q_n) \end{aligned}\] Consequently: $$P(a_1, a_2, \dots, a_n | Q) = P(a_1 | q_1) P(a_2 | q_2) \dots P(a_n | q_n)$$ This yields the most common solution: directly outputting the tag with the maximum probability for each label. The preceding model is usually a multi-layer bidirectional LSTM.

Conditional Random Field

Per-tag softmax is a simple and effective method, but it sometimes produces unreasonable results. For example, when using SBME for 4-tag word segmentation, per-tag softmax cannot exclude the possibility of a sequence like bbbb, but this sequence violates our decoding rules (a b can only be followed by m or e). Some say per-tag softmax does not need dynamic programming, but that is incorrect; in this case, we need at least a "0 or 1" transition matrix to directly set unreasonable transition probabilities to 0 (such as $P(b|b)=0$), and then use dynamic programming to ensure a reasonable sequence is obtained.

The problem with the above scheme fundamentally lies in the fact that when modeling, we used the naive assumption of complete output independence (unigram model), but our actual output sequences are context-dependent, resulting in a mismatch between the optimization objective and the model assumptions. Can we directly incorporate the context? Simple: just use a bigram model. \[\begin{aligned} P_Q(a_1, a_2, \dots, a_n) &= P_Q(a_1) P_Q(a_2 | a_1) P_Q(a_3 | a_1, a_2) \dots P_Q(a_n | a_1, \dots, a_{n-1}) \\ &= P_Q(a_1) P_Q(a_2 | a_1) P_Q(a_3 | a_2) \dots P_Q(a_n | a_{n-1}) \end{aligned}\] To make the expression look better, I moved the input $Q$ into the subscript. This is already very close to CRF!

Continuing to observe the equation above, it is a product of transition probabilities. However, why must we define each term as a transition probability? Thus, the CRF approach is very general. First, it defines a function $f(x, y; Q)$ (it could be a sum of some simple feature functions, but the specific form is not actually important), and then directly sets: $$P_Q(a_1, a_2, \dots, a_n) = \frac{1}{Z} \exp \left( \sum_k f(a_{k-1}, a_k; Q) \right)$$ where $Z$ is the normalization factor. Compared to the previous equation, the difference is that $P_Q(a_k | a_{k-1})$ has probabilistic meaning (conditional probability), while the individual term $e^{f(a_{k-1}, a_k; Q)}/Z$ does not have probabilistic meaning. Therefore, CRF is a more general form.

That is all there is to CRF.

A more complete reference link: https://zhuanlan.zhihu.com/p/28465510

Linear Chain CRF

Wait? Are you kidding me? This is CRF? What on earth are these $P_Q(a_k | a_{k-1})$ terms? And what about $f(x, y; Q)$?

Dear reader, you've stumped me. I don't know what they are either. If you don't believe me, look at some tutorials online; the formulas they give are roughly like this (copied directly from here): \[\begin{aligned} p(l | s) &= \frac{\exp[score(l | s)]}{\sum_{l'} \exp[score(l' | s)]} \\ &= \frac{\exp[\sum_{j=1}^m \sum_{i=1}^n \lambda_j f_j(s, i, l_i, l_{i-1})]}{\sum_{l'} \exp[\sum_{j=1}^m \sum_{i=1}^n \lambda_j f_j(s, i, l'_i, l'_{i-1})]} \end{aligned}\] The $f$ terms here are all unknown "feature functions" that need to be specifically designed according to the problem, which is equivalent to saying they are unknown $f(a_{k-1}, a_k; Q)$. So, I really don't know what they are.

Fine, even if you are right, at least teach me how to use it?

Here, I'll introduce a frequently used version—the linear chain CRF, which is the version included with TensorFlow. Let's first write out: \[\begin{aligned} &P_Q(a_1, a_2, \dots, a_n) \\ &= P_Q(a_1) P_Q(a_2 | a_1) P_Q(a_3 | a_2) \dots P_Q(a_n | a_{n-1}) \\ &= P_Q(a_1) \frac{P_Q(a_1, a_2)}{P_Q(a_1) P_Q(a_2)} P_Q(a_2) \frac{P_Q(a_2, a_3)}{P_Q(a_2) P_Q(a_3)} P_Q(a_3) \dots \frac{P_Q(a_{n-1}, a_n)}{P_Q(a_{n-1}) P_Q(a_n)} P_Q(a_n) \end{aligned}\] Doesn't that look quite nice? Building on the general idea of CRF, we discard the probabilistic meaning of each term and directly write: \[\begin{aligned} &P_Q(a_1, a_2, \dots, a_n) \\ &= \frac{1}{Z} \exp \Big[ f(a_1; Q) + g(a_1, a_2; Q) + f(a_2; Q) + \dots + g(a_{n-1}, a_n; Q) + f(a_n; Q) \Big] \end{aligned}\] The so-called "linear chain" simply assumes that the function $g$ is actually independent of $Q$; in any situation, they share a common $g(a_{k-1}, a_k)$, making it nothing more than a matrix to be determined. The rest is similar to the per-tag softmax case, assuming $f(a_k; Q) \equiv f(a_k; q_k)$. According to the maximum likelihood principle, the loss should be: \[\begin{aligned} &-\log P_Q(a_1, a_2, \dots, a_n) \\ &= - \sum_{k=1}^n f(a_k; q_k) - \sum_{k=2}^n g(a_{k-1}, a_k) + \log Z \end{aligned}\] If the preceding model uses a BiLSTM to obtain the features $q_k$, then we have the classic BiLSTM-CRF for sequence labeling tasks.

So, it is now easy to understand the CRF functions provided in TensorFlow: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/crf

Compared to per-tag softmax, CRF is merely a different loss function. Of course, it also adds a mutual information matrix (transition matrix), and decoding requires the Viterbi algorithm. But none of that is important, because TensorFlow has written it all for us.

Redesign?

The linear chain CRF can be seen as a simplified template. Is it possible for us to refer to this template and design an improved CRF? For example, using a model to generate a mutual information matrix that is dependent on $Q$? It might be possible.

Only by peeling the nutshell can one better taste the nut; that's the flavor of the nutshell. Once you know the "what" and the "why," everything becomes less mysterious.

(New introduction: https://kexue.fm/archives/5542)