GlobalPointer: A Unified Manner to Handle Nested and Non-nested NER

By 苏剑林 | May 01, 2021

(Note: The relevant content of this article has been organized into the paper 《Global Pointer: Novel Efficient Span-based Approach for Named Entity Recognition》. If you need to cite it, you can cite the English paper directly. Thank you.)

This article will introduce a design called GlobalPointer, which uses the idea of global normalization to perform Named Entity Recognition (NER). It can identify both nested and non-nested entities without distinction. In the case of non-nested (Flat NER), it achieves results comparable to CRF, while in nested (Nested NER) scenarios, it also performs well. Furthermore, theoretically, the design philosophy of GlobalPointer is more reasonable than CRF; and in practice, it does not require recursive denominator calculation during training like CRF, nor does it require dynamic programming during prediction—it is completely parallel. In ideal conditions, the time complexity is $\mathcal{O}(1)$!

In short: more elegant, faster, and more powerful! Is there really such a good design? Let's take a look.

GlobalPointer

Conventional Pointer Network designs for entity recognition or reading comprehension usually use two modules to identify the start and end of an entity separately, which causes inconsistency between training and prediction. GlobalPointer is designed specifically to address this inconsistency. It treats the head and tail as a unified whole for discrimination, thus having a more "global view" (more Global).

Basic Idea

Specifically, suppose the input text sequence length is $n$. For simplicity, assume there is only one type of entity to be recognized, and assume each entity is a continuous segment of the sequence with no length limit, and they can be nested (overlapping entities). How many "candidate entities" are there in such a sequence? It is not difficult to deduce that the answer is $n(n+1)/2$, which is the number of all possible continuous subsequences. Our task is to pick the true entities from these $n(n+1)/2$ candidates, which is essentially a "k-out-of-$n(n+1)/2$" multi-label classification problem. If there are $m$ entity types, we carry out $m$ such multi-label classification problems. This is the basic idea of GlobalPointer: recognizing entities as the basic unit, as shown in the image at the beginning of this article.

Some readers might ask: Isn't the complexity of this design $\mathcal{O}(n^2)$? Won't it be very slow? In the era of RNNs/CNNs, it might have been perceived as slow. But now, in the era of Transformers, every layer of a Transformer already has $\mathcal{O}(n^2)$ complexity. Adding one layer of GlobalPointer or removing one doesn't make much difference. Crucially, $\mathcal{O}(n^2)$ is only the spatial complexity. If parallel performance is good, the time complexity can even drop to $\mathcal{O}(1)$, so there is no noticeable delay.

Mathematical Form

Let the input $t$ of length $n$ be encoded into a vector sequence $[\boldsymbol{h}_1,\boldsymbol{h}_2,\cdots,\boldsymbol{h}_n]$. Through transformations $\boldsymbol{q}_{i,\alpha}=\boldsymbol{W}_{q,\alpha}\boldsymbol{h}_i+\boldsymbol{b}_{q,\alpha}$ and $\boldsymbol{k}_{i,\alpha}=\boldsymbol{W}_{k,\alpha}\boldsymbol{h}_i+\boldsymbol{b}_{k,\alpha}$, we obtain vector sequences $[\boldsymbol{q}_{1,\alpha},\boldsymbol{q}_{2,\alpha},\cdots,\boldsymbol{q}_{n,\alpha}]$ and $[\boldsymbol{k}_{1,\alpha},\boldsymbol{k}_{2,\alpha},\cdots,\boldsymbol{k}_{n,\alpha}]$, which are used to identify entities of type $\alpha$. We can then define: \begin{equation}s_{\alpha}(i,j) = \boldsymbol{q}_{i,\alpha}^{\top}\boldsymbol{k}_{j,\alpha}\label{eq:s}\end{equation} as the score for the continuous segment from $i$ to $j$ being an entity of type $\alpha$. That is, the inner product of $\boldsymbol{q}_{i,\alpha}$ and $\boldsymbol{k}_{j,\alpha}$ is used as the logit for the segment $t_{[i:j]}$ being type $\alpha$. Under this design, GlobalPointer is actually a simplified version of Multi-Head Attention, where each entity type corresponds to one head, omitting calculations related to $\boldsymbol{V}$.

Relative Position

Theoretically, the design in Eq. $\eqref{eq:s}$ is sufficient. However, in practice, when training data is limited, its performance is often subpar because it does not explicitly contain relative position information. In the experiments later, we will see that the presence or absence of relative position information can make a difference of over 30 percentage points!

For example, if we want to identify place names in weather forecast text like "Beijing: 21 degrees; Shanghai: 22 degrees; Hangzhou: 23 degrees...", there are many entities to identify. Without relative position information, GlobalPointer is not very sensitive to the length and span of entities. Therefore, it easily predicts any combination of a start from one entity and an end from another (e.g., predicting "Beijing: 21 degrees; Shanghai" as an entity). Conversely, with relative position information, GlobalPointer becomes sensitive to entity spans and can better distinguish true entities.

Which relative position encoding should be used? Theoretically, any relative position encoding used in Transformers can be considered (refer to "Transformer Position Encodings that Rack Researchers' Brains"). However, in practice, most relative position encodings truncate the relative position. While this truncation range is usually enough for entity recognition, it lacks elegance. Without truncation, one faces the problem of too many learnable parameters. After consideration, I found the Rotary Positional Embedding (RoPE) I previously conceived to be suitable.

The introduction to RoPE can be found in "Transformer Upgrade Road: 2. Rotary Positional Embedding". It is essentially a transformation matrix $\boldsymbol{\mathcal{R}}_i$ satisfying the relationship $\boldsymbol{\mathcal{R}}_i^{\top}\boldsymbol{\mathcal{R}}_j = \boldsymbol{\mathcal{R}}_{j-i}$. Applying this to $\boldsymbol{q}$ and $\boldsymbol{k}$ respectively, we have: \begin{equation}s_{\alpha}(i,j) = (\boldsymbol{\mathcal{R}}_i\boldsymbol{q}_{i,\alpha})^{\top}(\boldsymbol{\mathcal{R}}_j\boldsymbol{k}_{j,\alpha}) = \boldsymbol{q}_{i,\alpha}^{\top} \boldsymbol{\mathcal{R}}_i^{\top}\boldsymbol{\mathcal{R}}_j\boldsymbol{k}_{j,\alpha} = \boldsymbol{q}_{i,\alpha}^{\top} \boldsymbol{\mathcal{R}}_{j-i}\boldsymbol{k}_{j,\alpha}\end{equation} This explicitly injects relative position information into the score $s_{\alpha}(i,j)$.

Optimization Details

In this section, we discuss details during the training of GlobalPointer, including the choice of loss function and the calculation and optimization of evaluation metrics. We can see that GlobalPointer's entity-based design offers many elegant conveniences.

Loss Function

So far, we have designed the score $s_{\alpha}(i,j)$. Identifying entities of a specific category $\alpha$ has become a multi-label classification problem with $n(n+1)/2$ classes. The next key is the design of the loss function. The most naive approach is to treat it as $n(n+1)/2$ binary classifications. However, in practice, $n$ is not small, $n(n+1)/2$ is even larger, while the number of entities in each sentence is small (usually in the single digits). Therefore, treating it as $n(n+1)/2$ binary classifications would lead to an extremely severe class imbalance problem.

This is where our previous research "Generalizing Softmax+CrossEntropy to Multi-label Classification" comes in handy. Simply put, this is a loss function for multi-label classification that is a generalization of single-target multi-class cross-entropy, particularly suitable for scenarios where the total number of categories is large and the number of target categories is small. Its form is not complex; for the GlobalPointer scenario, it is: \begin{equation}\log \left(1 + \sum\limits_{(i,j)\in P_{\alpha}} e^{-s_{\alpha}(i,j)}\right) + \log \left(1 + \sum\limits_{(i,j)\in Q_{\alpha}} e^{s_{\alpha}(i,j)}\right)\end{equation} where $P_{\alpha}$ is the set of (start, end) indices of all entities of type $\alpha$ in the sample, and $Q_{\alpha}$ is the set of indices for all non-entities or entities not of type $\alpha$. Note that we only need to consider combinations where $i\leq j$: \begin{equation}\begin{aligned} \Omega=&\, \big\{(i,j) \,\big\|\, 1\leq i\leq j\leq n\big\}\\ P_{\alpha}=&\, \big\{(i,j) \,\big\|\, t_{[i:j]} \text{ is an entity of type } \alpha\big\}\\ Q_{\alpha}=&\, \Omega - P_{\alpha} \end{aligned}\end{equation} In the decoding phase, all segments $t_{[i:j]}$ satisfying $s_{\alpha}(i,j) > 0$ are treated as entities of type $\alpha$. As can be seen, the decoding process is extremely simple, and under full parallelism, the decoding efficiency is $\mathcal{O}(1)$!

Evaluation Metrics

The common evaluation metric for NER is F1. Note that this is entity-level F1, not labeling-tag-level F1. Under traditional Pointer Network or CRF designs, calculating entity-level F1 during training is difficult. However, under the GlobalPointer design, calculating entity-level F1 or accuracy is very easy. For example, the F1 calculation is as follows:

def global_pointer_f1_score(y_true, y_pred):
    """F1 designed for GlobalPointer
    """
    y_pred = K.cast(K.greater(y_pred, 0), K.floatx())
    return 2 * K.sum(y_true * y_pred) / K.sum(y_true + y_pred)

It is this simple primarily due to GlobalPointer being "Global". Both y_true and y_pred are already at the entity level. By checking y_pred > 0, we know which entities were extracted, and then by matching them, we can calculate various entity-level metrics, achieving consistency between training, evaluation, and prediction.

Optimizing F1 Value

Another benefit of GlobalPointer's "Global" nature is that if we use it for Machine Reading Comprehension (MRC), it can directly optimize the MRC F1 metric! The F1 for MRC differs from NER F1 as it is a fuzzy match score of the answer. Directly optimizing F1 may be more beneficial for improving the final score. Using GlobalPointer for MRC is equivalent to NER with only one entity type. We define: \begin{equation}p(i,j) = \frac{e^{s(i,j)}}{\sum\limits_{i \leq j} e^{s(i,j)}}\end{equation} With $p(i,j)$ defined, and following reinforcement learning ideas (refer to "Policy Gradient and Zero-order Optimization Reaching the Same Goal"), optimizing F1 can be done using the following loss: \begin{equation}-\sum_{i\leq j} p(i,j) f_1(i,j) + \lambda \sum_{i\leq j}p(i,j)\log p(i,j)\end{equation} where $f_1(i,j)$ is the pre-calculated F1 similarity between the segment $t_{[i:j]}$ and the ground truth, and $\lambda$ is a hyperparameter. Calculating all $f_1(i,j)$ might be costly, but it is a one-time operation and can be optimized (e.g., setting to zero if the start and end are too far apart). Overall, it's within an acceptable range. If the goal is to improve the final F1 of MRC, this is a direct scheme to try. (I tried this on the Baidu LIC2021 MRC track this year and it indeed showed some effect.)

Experimental Results

Now that everything is ready, let's start the experiments. The experiment code is organized as follows:

Open source address: https://github.com/bojone/GlobalPointer

Currently, GlobalPointer is built into bert4keras >= 0.10.6. The three experimental tasks are all Chinese NER tasks. The first two are non-nested, and the third is nested NER. The sequence length statistics for their training sets are:

Dataset	Avg. Length (Chars)	Std Dev
People's Daily NER	46.93	30.08
CLUENER	37.38	10.71
CMeEE	54.15	80.27

People's Daily

First, we verify whether GlobalPointer can replace CRF in non-nested scenarios using the classic People's Daily corpus. The baseline is the BERT+CRF combination, compared against BERT+GlobalPointer. The results are as follows:

People's Daily NER Results
Model	Val F1	Test F1	Training Speed	Inference Speed
CRF	96.39%	95.46%	1x	1x
GlobalPointer (w/o RoPE)	54.35%	62.59%	1.61x	1.13x
GlobalPointer (w/ RoPE)	96.25%	95.51%	1.56x	1.11x

Firstly, the most striking visual impact in the table is the gap between GlobalPointer with and without RoPE, which reaches over 30 points! This demonstrates the importance of explicitly adding relative position information to GlobalPointer. In subsequent experiments, we will no longer verify the version without RoPE and assume it is added by default.

From the table, it is also evident that in classic non-nested NER tasks, GlobalPointer matches CRF in performance and even slightly surpasses it in speed. It can be described as both fast and effective.

CLUENER

Since the starting point for People's Daily is already quite high, the gap might not be obvious. To further test, we use the newer CLUENER dataset, which is also non-nested. The current SOTA F1 is around 81%. The comparison between BERT+CRF and BERT+GlobalPointer is as follows:

CLUENER Experimental Results
Model	Val F1	Test F1	Training Speed	Inference Speed
CRF	79.51%	78.70%	1x	1x
GlobalPointer	80.03%	79.44%	1.22x	1x

This result shows that as NER difficulty increases, even in non-nested scenarios, GlobalPointer's effect can outperform CRF. This indicates that for NER scenarios, GlobalPointer is actually more useful than CRF. We will later provide a simple theoretical analysis to further explain why GlobalPointer is theoretically more reasonable than CRF.

Regarding speed, since text lengths in this task are generally short, the speed increase for GlobalPointer is not very significant.

CMeEE

Finally, we test a nested task: CMeEE. It was the "Chinese Medical Named Entity Recognition" competition on biendata last year and is Task 1 of this year's "Chinese Medical Information Processing Challenge (CBLUE)". It is essentially medical NER with nested entities. Again, comparing CRF and GlobalPointer:

CMeEE Experimental Results
Model	Val F1	Test F1	Training Speed	Inference Speed
CRF	63.81%	64.39%	1x	1x
GlobalPointer	64.84%	65.98%	1.52x	1.13x

It can be seen that GlobalPointer clearly outperforms CRF in performance. Regarding speed, combined with the results of the three tasks, generally, the longer the text in the task, the more obvious the training acceleration of GlobalPointer. Prediction speed usually also sees a slight improvement, though not as large as the increase in the training phase. Later, I tinkered with RoBERTa-large as the encoder and found it (not too difficultly) reached over 67% on the online test set, proving GlobalPointer is a "competent" design.

Of course, some readers might complain: Using a non-nested CRF for nested NER makes the comparison with GlobalPointer unfair. True, to some extent. But it's not a major issue; on one hand, the F1 for CMeEE is still relatively low, and nested entities are not that numerous. Even ignoring the nested parts and treating them as non-nested wouldn't have a massive impact. On the other hand, for nested NER, I haven't found a simple and clear design that could easily serve as a baseline, so I ran CRF for a benchmark. Readers are welcome to report results from other designs.

Reflections and Extensions

In this section, we further compare CRF and GlobalPointer theoretically and introduce some works related to GlobalPointer to help readers better understand its positioning.

Compared to CRF

CRF (Conditional Random Field) is a classic design for sequence labeling. Since most NER can be transformed into sequence labeling, CRF is a classic method for NER. I have written articles like "A Concise Introduction to Conditional Random Field CRF (with Pure Keras Implementation)" and "Your CRF Layer's Learning Rate Might not be Big Enough". In previous introductions, we noted that if the number of sequence labels is $k$, the difference between frame-wise softmax and CRF is:

The former views sequence labeling as $n$ independent $k$-classification problems, while the latter views sequence labeling as one single $k^n$-classification problem.

This statement actually points out the theoretical shortcomings of using frame-wise softmax and CRF for NER. How so? Frame-wise softmax is too lenient because even if every label is predicted correctly, it doesn't mean the entity is correctly extracted—entities require all labels in a segment to be correct. Conversely, CRF viewing sequence labeling as one $k^n$-classification problem is too strict, meaning it requires every entity in the sequence to be predicted correctly to get a "correct" score; partial entities don't earn credit. Although CRF can produce partially correct results in practice, that's more a testament to the model's generalization; the CRF design itself implies a "full correctness or nothing" approach.

Therefore, CRF has theoretical aspects that are not entirely reasonable. In contrast, GlobalPointer is more aligned with application and evaluation scenarios: it is entity-based and designed as a "multi-label classification" problem. Consequently, its loss function and evaluation metrics are at the entity granularity, granting reasonable scores even if only some entities are correct. Therefore, it is "within reason" that GlobalPointer can achieve better results than CRF even in non-nested NER scenarios.

Related Work

If readers follow progress in entity recognition and information extraction, they might notice that GlobalPointer is very similar to TPLinker, a recent design for relation extraction. However, this idea of global normalization can be traced back even further.

I first encountered this idea in the paper 《Globally Normalized Reader》 published by Baidu in 2017, which proposed a Global Normalization Reader (GNR) for MRC. It not only treats (start, end) as a whole but (sentence, start, end) as a whole (following a flow of selecting a sentence first, then start/end), resulting in many combinations. It used ideas from 《Sequence-to-Sequence Learning as Beam-Search Optimization》 to reduce computation.

With GNR as a foundation, GlobalPointer is a quite natural expansion. In fact, back in 2019 while participating in the LIC2019 relation extraction track, similar ideas occurred to me, but several issues were unresolved at the time.

First, Transformer wasn't as prevalent, and $\mathcal{O}(n^2)$ complexity seemed daunting. Second, "Generalizing Softmax+CrossEntropy to Multi-label Classification" hadn't been conceived, so there was no good solution for class imbalance in multi-label classification. Third, my understanding of NLP was shallower, and bert4keras wasn't developed, making experiments difficult—if I had omitted RoPE back then and seen a 30-point drop, I wouldn't have known how to tune it.

Thus, GlobalPointer is a culmination of accumulation over the last two years—a bit "coincidental" but also "water reaching its course." As for TPLinker, it has no direct connection to the origin of GlobalPointer. Formally, GlobalPointer is indeed similar to TPLinker. In fact, TPLinker can be traced back further to 《Joint entity recognition and relation extraction as a multi-head selection problem》, but those works mainly applied Global ideas to relation extraction and did not optimize specifically for NER.

Additive vs Multiplicative

In implementation, a major difference between TPLinker and GlobalPointer is that for Multi-Head, TPLinker uses additive Attention: \begin{equation}s_{\alpha}(i,j) = \boldsymbol{W}_{o,\alpha}\tanh\left(\boldsymbol{W}_{h,\alpha}[\boldsymbol{h}_{i},\boldsymbol{h}_{j}]+\boldsymbol{b}_{h,\alpha}\right)+\boldsymbol{b}_{o,\alpha} \end{equation} It is currently unclear how much effect difference there is between this choice and Eq. $\eqref{eq:s}$. However, compared to the multiplicative Attention in Eq. $\eqref{eq:s}$, while their theoretical complexity is similar, the actual implementation of additive Attention is much more costly, especially in terms of space (video memory).

Therefore, I believe even if additive performance is slightly better, one should optimize based on the multiplicative approach, as additive efficiency is truly lacking. Additionally, papers like TPLinker did not report the importance of relative position information. Is relative position less important in additive Attention? This is currently unknown.

Conclusion

This article introduced GlobalPointer, a new design for NER based on the idea of global pointers. It integrates several of my previous research results to implement an "ideal design" for handling both nested and non-nested NER in a unified manner. Experimental results show that in non-nested scenarios, it achieves results comparable to CRF, and it also performs well in nested scenarios.