By 苏剑林 | May 18, 2018
Last year, I wrote a blog post "CRF In A Nutshell", which introduced the Conditional Random Field (CRF) model in a somewhat rough manner. However, that article clearly had many deficiencies, such as being insufficiently clear and incomplete, and lacking an implementation. Here, we revisit this model to supplement and complete the relevant content.
This article provides a concise introduction to the basic principles of CRF. Of course, "concise" is relative; to truly understand CRF, it is inevitable to mention some formulas. Readers who only care about usage can skip directly to the end of the text.
Following our previous line of thought, let's compare the similarities and differences between standard frame-wise softmax and CRF.
CRF is mainly used for sequence labeling problems, which can be simply understood as classifying each frame in a sequence. Since it is classification, it is natural to think of encoding the sequence using CNN or RNN and then connecting a fully connected layer with softmax activation, as shown below:

Frame-wise softmax does not directly consider output context correlation
However, when we design labels—for example, using four labels s, b, m, and e for word segmentation—the target output sequence itself carries certain contextual correlations. For instance, 's' cannot be followed by 'm' or 'e', and so on. Frame-wise softmax does not consider these contextual correlations at the output level, which means it places the burden of learning these correlations onto the encoding level, hoping the model can learn them on its own, which sometimes "asks too much" of the model.
CRF is more direct: it separates the correlations at the output level, allowing the model to learn more "comfortably":

CRF explicitly considers contextual correlation at the output end
Of course, simply introducing output correlations is not the whole story of CRF. The truly ingenious part of CRF is that it takes the path as the unit and considers the probability of the path.
Suppose an input has $n$ frames and each frame has $k$ possible labels. Theoretically, there are $k^n$ different possible outputs. We can visualize this using the following network diagram. In the diagram below, each node represents the possibility of a label, the lines between nodes represent the correlations between labels, and each labeling result corresponds to a complete path in the diagram.

Output network diagram in a 4-tag segmentation model
In sequence labeling tasks, our correct answer is generally unique. For example, for "今天天气不错" (Today's weather is good), if the segmentation result is "今天/天气/不/错", then the target output sequence is bebess. No other path fits the requirement. In other words, in sequence labeling tasks, our basic unit of study should be the path. What we want to do is select the correct one from $k^n$ paths. This means that if we treat it as a classification problem, it is a task of selecting one category out of $k^n$!
This is the fundamental difference between frame-wise softmax and CRF: the former views sequence labeling as $n$ independent $k$-classification problems, while the latter views it as a single $k^n$-classification problem.
Specifically, in the CRF sequence labeling problem, we want to calculate the conditional probability:
\begin{equation} P(y_1,\dots,y_n|x_1,\dots,x_n)=P(y_1,\dots,y_n|\boldsymbol{x}),\quad \boldsymbol{x}=(x_1,\dots,x_n) \tag{1} \end{equation}To obtain an estimate of this probability, CRF makes two assumptions:
Assumption 1: The distribution belongs to the exponential family.
This assumption implies there exists a function $f(y_1,\dots,y_n;\boldsymbol{x})$ such that:
\begin{equation} P(y_1,\dots,y_n|\boldsymbol{x})=\frac{1}{Z(\boldsymbol{x})}\exp\Big(f(y_1,\dots,y_n;\boldsymbol{x})\Big) \tag{2} \end{equation}where $Z(\boldsymbol{x})$ is the normalization factor. Since this is a conditional distribution, the normalization factor depends on $\boldsymbol{x}$. This function $f$ can be viewed as a scoring function. The probability distribution is obtained by taking the exponential of the scoring function and normalizing it.
Assumption 2: Correlations between outputs occur only at adjacent positions, and the correlations are exponentially additive.
This assumption implies that $f(y_1,\dots,y_n;\boldsymbol{x})$ can be further simplified as:
\begin{equation} \begin{aligned}f(y_1,\dots,y_n;\boldsymbol{x})=&h(y_1;\boldsymbol{x})+g(y_1,y_2;\boldsymbol{x})+h(y_2;\boldsymbol{x})+g(y_2,y_3;\boldsymbol{x})+h(y_3;\boldsymbol{x})\\ &+\dots+g(y_{n-1},y_n;\boldsymbol{x})+h(y_n;\boldsymbol{x})\end{aligned} \tag{3} \end{equation}This means we only need to score each individual label and each adjacent label pair, and then sum all these scores to get the total score.
Despite significant simplifications, the probability model represented by Equation $(3)$ is generally still too complex to solve. Given that in current deep learning models, RNNs or stacked CNNs are already capable of capturing the relationship between each $y$ and the input $\boldsymbol{x}$ quite effectively, we might as well consider the function $g$ to be independent of $\boldsymbol{x}$:
\begin{equation} \begin{aligned}f(y_1,\dots,y_n;\boldsymbol{x})=h(y_1;\boldsymbol{x})+&g(y_1,y_2)+h(y_2;\boldsymbol{x})+\dots\\ +&g(y_{n-1},y_n)+h(y_n;\boldsymbol{x})\end{aligned} \tag{4} \end{equation}In this case, $g$ is effectively just a finite, trainable parameter matrix, while the single-label scoring function $h(y_i;\boldsymbol{x})$ can be modeled through an RNN or CNN. Thus, the model can be established, where the probability distribution becomes:
\begin{equation} P(y_1,\dots,y_n|\boldsymbol{x})=\frac{1}{Z(\boldsymbol{x})}\exp\left(h(y_1;\boldsymbol{x})+\sum_{t=1}^{n-1}\Big[g(y_t,y_{t+1})+h(y_{t+1};\boldsymbol{x})\Big]\right) \tag{5} \end{equation}This is the concept of a Linear Chain CRF.
To train the CRF model, we use the maximum likelihood method, which involves using:
\begin{equation} -\log P(y_1,\dots,y_n|\boldsymbol{x}) \tag{6} \end{equation}as the loss function. It can be calculated that this equals:
\begin{equation} -\left(h(y_1;\boldsymbol{x})+\sum_{t=1}^{n-1}\Big[g(y_t,y_{t+1})+h(y_{t+1};\boldsymbol{x})\Big]\right)+\log Z(\boldsymbol{x}) \tag{7} \end{equation}The first term is the logarithm of the numerator of the original probability expression, which is the score of the target sequence. Although it looks circuitous, it is not difficult to compute. The real difficulty lies in calculating the logarithm of the denominator, $\log Z(\boldsymbol{x})$.
The normalization factor, also called the partition function in physics, requires us to sum the exponential scores of all possible paths. As we mentioned earlier, the number of such paths is exponential ($k^n$), making direct calculation practically impossible.
In fact, the difficulty of calculating the normalization factor is a common problem in almost all graphical probability models. Fortunately, in the CRF model, because we only consider the connection between adjacent labels (the Markov assumption), we can calculate the normalization factor recursively, which reduces the computational complexity from exponential to linear. Specifically, we denote the normalization factor calculated up to time $t$ as $Z_t$, and decompose it into $k$ parts:
\begin{equation} Z_t = Z^{(1)}_t + Z^{(2)}_t + \dots + Z^{(k)}_t \tag{8} \end{equation}where $Z^{(1)}_t, \dots, Z^{(k)}_t$ are the sums of the exponential scores of all paths up to current time $t$ that end with labels $1, \dots, k$, respectively. Then, we can calculate recursively:
\begin{equation} \begin{aligned}Z^{(1)}_{t+1} = &\Big(Z^{(1)}_t G_{11} + Z^{(2)}_t G_{21} + \dots + Z^{(k)}_t G_{k1} \Big) H_{t+1}(1|\boldsymbol{x})\\ Z^{(2)}_{t+1} = &\Big(Z^{(1)}_t G_{12} + Z^{(2)}_t G_{22} + \dots + Z^{(k)}_t G_{k2} \Big) H_{t+1}(2|\boldsymbol{x})\\ &\qquad\qquad\vdots\\ Z^{(k)}_{t+1} =& \Big(Z^{(1)}_t G_{1k} + Z^{(2)}_t G_{2k} + \dots + Z^{(k)}_t G_{kk} \Big) H_{t+1}(k|\boldsymbol{x}) \end{aligned} \tag{9} \end{equation}This can be simply written in matrix form as:
\begin{equation} \boldsymbol{Z}_{t+1} = \boldsymbol{Z}_{t} \boldsymbol{G} \otimes H(y_{t+1}|\boldsymbol{x}) \tag{10} \end{equation}where $\boldsymbol{Z}_{t}=\Big[Z^{(1)}_t,\dots,Z^{(k)}_t\Big]$; $\boldsymbol{G}$ is the matrix obtained by taking the exponential of each element of matrix $g$ (as mentioned, in the simplest case, $g$ is just a matrix representing the scores from one label to another), i.e., $\boldsymbol{G}_{ij}=e^{g_{ij}}$; and $H(y_{t+1}|\boldsymbol{x})$ is the exponential of the scores for each label at position $t+1$ from the encoding model $h(y_{t+1}|\boldsymbol{x})$ (RNN, CNN, etc.), i.e., $H(y_{t+1}|\boldsymbol{x})=e^{h(y_{t+1}|\boldsymbol{x})}$, which is also a vector. In Equation $(10)$, the step $\boldsymbol{Z}_{t} \boldsymbol{G}$ is a matrix multiplication resulting in a vector, and $\otimes$ is element-wise multiplication of two vectors.

Illustration of recursive calculation of the normalization factor. Calculation from time t to t+1 includes transition probabilities and the probability of node j+1 itself.
For readers who are unfamiliar, Equation $(10)$ might be difficult to accept at first glance. Readers can try writing out the normalization factors for $n=1, n=2, n=3$, look for their recursive relationships, and gradually understand Equation $(10)$.
Once we have written the loss function $-\log P(y_1,\dots,y_n|\boldsymbol{x})$, model training can be completed because current deep learning frameworks already include automatic differentiation functions. As long as we can write a differentiable loss, the framework can handle the optimization process for us.
The remaining final step is how to find the optimal path based on an input after model training is complete. As before, this is a problem of selecting the best path out of $k^n$ possibilities. Similarly, due to the Markov assumption, this can be transformed into a dynamic programming problem solved by the Viterbi algorithm, with a computational complexity proportional to $n$.
Dynamic programming has appeared multiple times in this blog. Its recursive idea is: if an optimal path is cut into two segments, each segment must be a (locally) optimal path. You can type "Dynamic Programming" in the search box on the right of this blog to find many related introductions, so I won't repeat it here~
After debugging, I obtained a concise implementation of a linear chain CRF based on the Keras framework. This might be the shortest CRF implementation. Here, I share the final implementation and explain the key points.
As we explained earlier, the difficulty in implementing a CRF is the calculation of $-\log P(y_1,\dots,y_n|\boldsymbol{x})$, and the core difficulty is the calculation of the normalization factor $Z(\boldsymbol{x})$. Thanks to the Markov assumption, we obtained recursive equations $(9)$ or $(10)$, which represent the general calculation for $Z(\boldsymbol{x})$.
How do we implement this recursive calculation in a deep learning framework? Note that from the perspective of a computational graph, this is defining a graph through recursion, and the length of this graph is not fixed. This shouldn't be a problem for dynamic graph frameworks like PyTorch, but it is difficult for static graph frameworks like TensorFlow or Keras (which is based on TensorFlow).
However, it is not impossible. We can use the packaged rnn function for calculation! We know that an RNN essentially performs recursive calculation:
Newer versions of TensorFlow and Keras already allow us to define custom RNN cells, which means the function $f$ can be defined by ourselves, while the backend automatically handles the recursive calculation for us. So, we only need to design an RNN such that the $\boldsymbol{Z}$ we want to calculate corresponds to the hidden vector of the RNN!
This is the most exquisite part of the CRF implementation.
The rest are some minor details, including:
1. To prevent overflow, we usually take logs. Since the normalization factor is a sum of exponentials, it is actually in the format $\log\left(\sum_{i=1}^k e^{a_i}\right)$. The trick for this calculation is:
$$\log\left(\sum_{i=1}^k e^{a_i}\right)=A + \log\left(\sum_{i=1}^k e^{a_i-A}\right),\quad A = \max \{a_1,\dots,a_k\}$$
TensorFlow and Keras both already have the correspondinglogsumexpfunction; call it directly.2. For the calculation of the numerator (the score of the target sequence), the trick used in the code is to multiply the "target sequence" with the "predicted sequence" to extract the target score.
3. Regarding how to mask padding parts for variable-length inputs: I think Keras doesn't do a very good job in this area. To implement this masking simply, my approach is to introduce one extra label. For example, if there were originally four labels s, b, m, e for segmentation, I introduce a fifth label, say 'x'. Set the labels for the padding parts to 'x', and then ignore the existence of the fifth label during CRF loss calculation. See the code for specific details.
A pure Keras implementation of the CRF layer, feel free to use it~
# -*- coding:utf-8 -*-
from keras.layers import Layer
import keras.backend as K
class CRF(Layer):
"""Pure Keras implementation of the CRF layer.
The CRF layer is essentially a loss calculation layer with trainable parameters.
Therefore, the CRF layer is only used for training the model,
while prediction requires building a separate model.
"""
def __init__(self, ignore_last_label=False, **kwargs):
"""ignore_last_label: Define whether to ignore the last label, serving as a mask.
"""
self.ignore_last_label = 1 if ignore_last_label else 0
super(CRF, self).__init__(**kwargs)
def build(self, input_shape):
self.num_labels = input_shape[-1] - self.ignore_last_label
self.trans = self.add_weight(name='crf_trans',
shape=(self.num_labels, self.num_labels),
initializer='glorot_uniform',
trainable=True)
def log_norm_step(self, inputs, states):
"""Recursive calculation of the normalization factor.
Key points: 1. Recursive calculation; 2. Use logsumexp to avoid overflow.
Trick: Use expand_dims to align tensors.
"""
inputs, mask = inputs[:, :-1], inputs[:, -1:]
states = K.expand_dims(states[0], 2) # (batch_size, output_dim, 1)
trans = K.expand_dims(self.trans, 0) # (1, output_dim, output_dim)
outputs = K.logsumexp(states + trans, 1) # (batch_size, output_dim)
outputs = outputs + inputs
outputs = mask * outputs + (1 - mask) * states[:, :, 0]
return outputs, [outputs]
def path_score(self, inputs, labels):
"""Calculate the relative probability of the target path (not yet normalized).
Key points: sum of individual label scores plus transition probability scores.
Trick: Use "prediction" dot "target" to extract the score of the target path.
"""
point_score = K.sum(K.sum(inputs * labels, 2), 1, keepdims=True) # individual label scores
labels1 = K.expand_dims(labels[:, :-1], 3)
labels2 = K.expand_dims(labels[:, 1:], 2)
labels = labels1 * labels2 # two offset labels responsible for extracting target transition scores
trans = K.expand_dims(K.expand_dims(self.trans, 0), 0)
trans_score = K.sum(K.sum(trans * labels, [2, 3]), 1, keepdims=True)
return point_score + trans_score # Sum of the two parts
def call(self, inputs): # CRF itself does not change the output, it is just a loss
return inputs
def loss(self, y_true, y_pred): # Target y_true needs to be in one-hot format
if self.ignore_last_label:
mask = 1 - y_true[:, :, -1:]
else:
mask = K.ones_like(y_pred[:, :, :1])
y_true, y_pred = y_true[:, :, :self.num_labels], y_pred[:, :, :self.num_labels]
path_score = self.path_score(y_pred, y_true) # Calculate numerator (log)
init_states = [y_pred[:, 0]] # Initial state
y_pred = K.concatenate([y_pred, mask])
log_norm, _, _ = K.rnn(self.log_norm_step, y_pred[:, 1:], init_states) # Calculate Z vector (log)
log_norm = K.logsumexp(log_norm, 1, keepdims=True) # Calculate Z (log)
return log_norm - path_score # i.e., log(denominator/numerator) actually the reverse for minimization
def accuracy(self, y_true, y_pred): # Function to display frame-wise accuracy during training, excluding mask
mask = 1 - y_true[:, :, -1] if self.ignore_last_label else None
y_true, y_pred = y_true[:, :, :self.num_labels], y_pred[:, :, :self.num_labels]
isequal = K.equal(K.argmax(y_true, 2), K.argmax(y_pred, 2))
isequal = K.cast(isequal, 'float32')
if mask == None:
return K.mean(isequal)
else:
return K.sum(isequal * mask) / K.sum(mask)
Excluding comments and the accuracy code, the actual CRF code is only about 30 lines. It can be called a concise CRF implementation compared to any framework~
Implementing complex models purely with Keras is quite interesting. It is currently only tested on the TensorFlow backend, but theoretically compatible with Theano and CNTK backends with minor adjustments.
My GitHub also contains an example of Chinese word segmentation implemented using CNN+CRF, using the Bakeoff 2005 corpus. The example is a complete word segmentation implementation, including the Viterbi algorithm, segmentation output, etc.
GitHub address: https://github.com/bojone/crf/
Related content can also be found in my previous articles:
Finally, the introduction is complete. I hope everyone has gained something, and I hope the final implementation will be helpful to everyone~