By 苏剑林 | October 31, 2021
CLUE (Chinese GLUE) is an evaluation benchmark for Chinese Natural Language Processing, and it has already gained recognition from many teams. The official CLUE GitHub provides baselines for TensorFlow and PyTorch, but they are not very readable or easy to debug. In fact, whether it's TensorFlow or PyTorch, CLUE or GLUE, the author believes that most available baseline codes are far from user-friendly, and trying to understand them can be quite a painful experience.
Therefore, I decided to implement a set of CLUE baselines based on bert4keras. After a period of testing, I have basically reproduced the official benchmark results, and even outperformed them on some tasks. Most importantly, all the code is kept as clear and readable as possible, truly living up to the motto "Deep Learning for Humans."
Code Link: https://github.com/bojone/CLUE-bert4keras
Code Introduction
Below is a brief introduction to the construction ideas for each task baseline in this code. Before reading the article and code, readers are encouraged to observe the data format of each task themselves, as I will not provide a detailed introduction to the task data here.
Text Classification
First are the IFLYTEK and TNEWS tasks, which are ordinary text classification problems. The approach is very simple: use the conventional "[CLS] + sentence + [SEP]" input into BERT, and then take the hidden vector of [CLS] for classification.

Additionally, the pronoun disambiguation task WSC can also be transformed into a single-text classification task. The original task is to judge whether two fragments in a sentence (one of which is a pronoun) refer to the same object. The baseline approach marks these two fragments in the text with different symbols and then passes the marked text directly into BERT for binary classification.
Text Matching
Next, AFQMC, CMNLI, and OCNLI are three text matching tasks. Simply put, text matching is a sentence pair classification task. For example, similarity matching determines whether two sentences are similar; Natural Language Inference (NLI) determines the logical relationship between two sentences (entailment, neutral, contradiction), etc. In the pre-training era, the standard approach for sentence matching tasks is to connect two sentences with [SEP] and treat it as a single-text classification task.

It should be noted that in the original BERT, the SegmentIDs (originally called token type ids) of the two sentences are different. However, considering that models like RoBERTa do not have an NSP task, a SegmentID numbered 1 might not have been pre-trained. Therefore, in this implementation, the SegmentIDs are all set to 0. Experimental results show that this treatment does not reduce the effectiveness of text matching.
Similarly, for the CSL task, which judges whether an abstract description matches 4 given keywords, we connect the 4 keywords with semicolons ";" and treat them as one sentence, thus converting it into a conventional text matching problem.
Reading Comprehension
Reading comprehension refers to the CMRC2018 task, which is an extractive reading comprehension task with a format similar to SQuAD. A paragraph is paired with multiple questions, and each question must have an answer that is a fragment of the paragraph. The common approach is to concatenate the question and paragraph with [SEP], input it into BERT, and then use two fully connected layers to predict the start and end positions. The problem with this is that it severs the connection between the start and end and creates an inconsistency between training and prediction behavior.
The baseline here uses GlobalPointer as the output structure, which treats the start-end combination as a single entity for classification. Specific details can be found in "GlobalPointer: A Unified Way to Handle Nested and Non-nested NER". Using GlobalPointer ensures that the training and prediction behaviors are completely consistent and significantly improves decoding speed.

Furthermore, regardless of whether it is SQuAD or CMRC2018, the majority of paragraph lengths significantly exceed 512, and the answers to some questions are indeed located towards the end; truncating the front part might result in no answer being found. While models like NEZHA and RoFormer can directly handle text exceeding 512, models like BERT do not handle this well. To maintain code generality, the sliding window design from the original BERT baseline is followed here: the paragraph is segmented into multiple sub-paragraphs with a stride of 128, and each sub-paragraph is paired with the question and passed into the model. After such segmentation, it is allowed for certain sub-paragraphs to be "answerless" for the question, in which case the answer points directly to the [CLS] position $(0,0)$. During the prediction phase, long paragraphs are segmented the same way, each part answers the same question, and the answer with the highest score is finally selected.
Multiple Choice
The multiple choice here refers to the C3 task, which can also be considered a reading comprehension task. Similarly, a paragraph has multiple questions, and the answer to each question is one of four given candidate answers, but not necessarily a fragment of the paragraph. The baseline approach for this multiple choice task might be surprising to many: it is converted into a text matching problem. Each candidate answer is matched against the paragraph and question, and the one with the highest score is taken during prediction.

As a result, an original question needs to be split into 4 samples to be processed, requiring 4 predictions to determine the answer, which greatly increases the calculation volume. However, surprisingly, this approach is essentially the best-performing among all intuitively conceived baselines, far better than concatenating all candidates and performing a 4-way classification. A similar task in the English domain is DREAM, and the models on its leaderboard are essentially variants of this idea.
Idiom Understanding
The idiom reading comprehension task, CHID, is essentially also a multiple choice reading comprehension problem, but it is more complex in form, so it is introduced separately here.
Specifically, each sample in CHID has 10 candidate idioms and several questions, with each question having several blanks. We need to decide which candidate idiom is most suitable for each blank. If each question had only one blank, we could directly apply the multiple choice approach from the previous section. However, here, a question may have multiple blanks. With the multiple choice approach, we can only identify one blank at a time, so we use [unused0] to represent the blank we want to identify, and other blanks (if any) are directly replaced by 4 [MASK] tokens, for example:
[CLS] “这其实是个荒唐的闹剧,苹果发现iPad大陆商标的拥有人不属于台湾唯冠而是深圳唯冠后,开始着急了并 [unused1] 。”肖才元表示。事实上,两个戏剧性的因素让该案更显得 [MASK][MASK][MASK][MASK] 。苹果在香港法院提起的诉讼案件中,所提交的材料显示,IPADL公司实为苹果公司律师操作下成立的具有特殊目的的,旨在用于收购唯冠手中i-Pad商标权的公司。 [SEP] 一锤定音 [SEP]
In other words, a question with multiple blanks will be split into multiple sub-problems. According to the aforementioned multiple choice method, each sub-problem needs to be concatenated with candidate answers for prediction, so the computational cost of each sub-problem is equivalent to 10 samples of normal classification. This is indeed a bit taxing, but it's necessary for performance. To achieve a larger effective batch_size, gradient accumulation is usually used. Additionally, some questions are quite long, so we still need to truncate. The truncation method centers on the blank to be identified, extending as far as possible to both the left and right.
Finally, according to the task design, each sample has several questions, and every blank in every question shares the same 10 candidate idioms, but the answer for each blank will not be repeated. If during prediction, each blank independently takes the answer with the maximum value, duplicate prediction results might occur, which would contradict the problem design.
To ensure non-duplicate prediction results, we need to use the "Hungarian Algorithm": assuming there are $m$ blanks and each blank has $n > m$ candidate answers, we will obtain $m \times n$ scoring sentences. We need to select a unique answer for each blank such that the total score is maximized. In mathematics, this is known as the "Assignment Problem," and the standard solution is the "Hungarian Algorithm." We can directly use scipy.optimize.linear_sum_assignment to solve it. This post-processing algorithm improves accuracy by about 6% compared to simply taking the maximum for each item (which could lead to duplicate answers).
Entity Recognition
The final task is CLUENER, a conventional non-nested Named Entity Recognition (NER) task. Common baselines are BERT+Softmax or BERT+CRF. The one used here is BERT+GlobalPointer, which can also be referenced in "GlobalPointer: A Unified Way to Handle Nested and Non-nested NER". When GlobalPointer is used for NER, it can handle both nested and non-nested cases in a unified manner. My multiple experiments have shown that in non-nested scenarios, GlobalPointer can achieve performance comparable to CRF, with faster training and prediction speeds. Therefore, using GlobalPointer as the baseline for NER is a logical choice.
Effect Comparison
The results for each task on the CLUE test set are compared as follows. Those marked with $_{\text{-old}}$ are the results found on the official CLUE website, and those marked with $_{\text{-our}}$ are the reproduction results from this set of code. The BERT and RoBERTa models here are both base versions; BERT is the original Chinese BERT released by Google, and RoBERTa is the RoBERTa_wwm_ext open-sourced by Harbin Institute of Technology. The large versions will be tested if there is enough computing power and time.
$$
\begin{array}{c}
\text{Classification Tasks} \\
{\begin{array}{c|ccccccc}
\hline
& \text{IFLYTEK} & \text{TNEWS} & \text{AFQMC} & \text{CMNLI} & \text{OCNLI} & \text{WSC} & \text{CSL} \\
\hline
\text{BERT}_{\text{-old}} & 60.29 & 57.42 & 73.70 & 79.69 & 72.20 & 74.60 & 80.36 \\
\text{BERT}_{\text{-our}} & 61.19 & 56.29 & 73.37 & 79.37 & 71.73 & 73.85 & 84.03 \\
\hline
\text{RoBERTa}_{\text{-old}} & 60.31 & \text{-} & 74.04 & 80.51 & \text{-} & \text{-} & 81.00 \\
\text{RoBERTa}_{\text{-our}} & 61.12 & 58.35 & 73.61 & 80.81 & 74.27 & 82.28 & 85.33 \\
\hline
\end{array}}
\end{array}
$$
$$
\begin{array}{c}
\text{Reading Comprehension and NER Tasks} \\
{\begin{array}{c|cccc}
\hline
& \text{CMRC2018} & \text{C3} & \text{CHID} & \text{CLUENER} \\
\hline
\text{BERT}_{\text{-old}} & 71.60 & 64.50 & 82.04 & 78.82 \\
\text{BERT}_{\text{-our}} & 72.10 & 61.33 & 85.13 & 78.68 \\
\hline
\text{RoBERTa}_{\text{-old}} & 75.20 & 66.50 & 83.62 & \text{-} \\
\text{RoBERTa}_{\text{-our}} & 75.40 & 67.11 & 86.04 & 79.38 \\
\hline
\end{array}}
\end{array}
$$
Note: TNEWS and WSC have empty entries for RoBERTa because they updated their test sets later, but the official GitHub did not update their RoBERTa test results in time. OCNLI and CLUENER have empty entries because the official measurements only provided BERT-base and RoBERTa-large results; RoBERTa-base results were not provided.
Article Summary
This article shared the CLUE evaluation benchmark code constructed based on bert4keras and briefly introduced the modeling ideas for each type of task. This set of baseline code is characterized by being simple, clear, and easy to migrate. It basically reaches the benchmark results claimed by CLUE official, and even performs better in some tasks, so it can be considered a qualified baseline code.