bert4keras in hand, I have the baseline: Baidu LIC2020

By 苏剑林 | April 02, 2020

Baidu's "2020 Language and Intelligence Challenge" has begun. This year features five tracks: Machine Reading Comprehension, Recommendation-based Dialogue, Semantic Parsing, Relation Extraction, and Event Extraction. For each track, the organizers have provided baseline models based on PaddlePaddle. Here, I am providing my personal baselines for three of these tracks based on bert4keras. From these, we can see how quick, convenient, and concise it is to build baseline models using bert4keras.

Address: https://github.com/bojone/lic2020_baselines

Brief Analysis of Ideas

Here is a brief analysis of the task characteristics of these three tracks and the design of the corresponding baselines.

Reading Comprehension

Sample example:

There isn't much to say about this baseline; it essentially follows BERT with two Dense layers + Softmax to predict the start and end of the answer respectively. Some training samples are labeled with multiple answers, but since only one answer needs to be predicted during inference, one answer is randomly selected for training during each step of the training phase.

Relation Extraction

Sample example:

Relation Extraction is essentially last year's triplet extraction task, but it has undergone some upgrades this year. The upgrade lies in considering the polysemy of the same predicate. For example, "stars in" might refer to which TV series someone starred in, or it might refer to which role they played in a TV series. If the same sentence contains multiple different objects (object) being "starred in," all of them must be extracted to be considered correct. While it's called an upgrade, there is no fundamental change; we simply need to concatenate the predicate with its object prefix to treat them as different predicates, which reduces the problem to a conventional triplet extraction task. for instance, "stars_in_@value" and "stars_in_inWork" are treated as two different predicates to be extracted separately. My baseline model is still based on last year's "semi-pointer, semi-labeling" design. For details, please refer to "A Lightweight Information Extraction Model based on DGCNN and Probabilistic Graphs".

Event Extraction

Sample example:

Event Extraction is a relatively new task that involves extracting event types and the elements describing that event. A single sentence may contain multiple events, and a single entity can simultaneously describe multiple events (for example, "Month XX, Day XX" could be the occurrence time for several events). Event extraction itself is a complex task, but for this competition, the organizers only evaluate triplets composed of (event_type, role, argument)—meaning if such a triplet matches, 1 point is awarded. Since event_type and role are discrete categories and argument is an entity from the original text, the official evaluation metric effectively reduces this task to a standard entity labeling problem. Consequently, it can be solved using conventional sequence labeling models. Both my baseline and the official baseline are presented as sequence labeling tasks.

Matching the Original Sequence

The three competitions mentioned above are essentially extraction problems, meaning the output entities are fragments of the original text. However, after the original text passes through the BERT tokenizer, it may not perfectly align with the original text due to the possibility of minor "additions," "deletions," or "modifications" (e.g., lowercase conversion, changes in space counts, or character transliteration). While these minor changes are negligible for engineering evaluations, they are crucial for competitions or academic evaluations because even if characters look the same, if they are different, it results in a matching error. For example, readers can try running the following code in Python:

In order to map the tokenized results back to the original sequence, I spent some time adding a rematch method to bert4keras's Tokenizer. By passing the original text and the tokenized results, it returns the mapping relationship from tokens to the original text. With this mapping, you can slice directly from the original text. For the specific implementation, please refer directly to the baseline code.

Summary

Writing three baselines and churning out another blog post~