DGCNN: A CNN-based Reading Comprehension Question Answering Model

By 苏剑林 | April 15, 2018

Update 2019.08.20: An open-source Keras version has been released (https://kexue.fm/archives/6906)

As early as the beginning of the year in the introductory article on "Attention is All You Need", I promised to share my experience using CNNs in NLP. However, I haven't had the chance until now. These past few days, I finally decided to organize the relevant content.

Background

Without further ado, let's first introduce the basic situation of the model.

Model Features

This model—which I call DGCNN—is based on CNNs and a simple Attention mechanism. Since it does not use an RNN structure, it is quite fast. Moreover, it is specifically tailored for WebQA-style tasks, making it very lightweight. Models at the top of the SQuAD leaderboard, such as AoA and R-Net, all use RNNs and are accompanied by complex attention interaction mechanisms, none of which appear in DGCNN.

This is a model that can be trained in just a few hours on a GTX 1060!

DGCNN, which stands for Dilated Gated Convolutional Neural Network, as the name suggests, integrates two relatively new convolutional techniques: Dilated Convolution and Gated Convolution. By adding some handcrafted features and tricks, the model achieves state-of-the-art results while remaining lightweight and fast. At the time of writing this article, the model described here reached the top of the leaderboard with a score (average of accuracy and F1) of 0.7583. It is currently the only model that has never dropped out of the top three and has won the weekly championship the most times.

Competition Status

This model was a product of my participation in the CIPS-SOGOU Question Answering Competition on behalf of "Guangzhou Flame Technology Co., Ltd." The competition started last October, but it has been somewhat dragging on, with no signs of ending or moving to new tasks.

Actually, in the first two or three months, the competition was quite intense. Many companies and universities submitted models, and the leaderboard was constantly being refreshed. I feel it's a bit of a letdown by SOGOU to be so indecisive, as it doesn't quite respect the enthusiasm of the participants. Most importantly, there have been no public notices about plans, changes, or the end date; contestants have just been left in limbo. I later heard the deadline is before this year's CIPS conference... a competition lasting a whole year??

Task Description

So far, this SOGOU competition has only held the factoid portion, which is basically the same as the WebQA dataset previously released by Baidu. That is, a "One Question + Multiple Materials" format, where the goal is to collectively decide the precise answer (usually an entity fragment) from multiple material segments.

Question: How many years of social security contributions are needed to receive a pension?
Answer: 15 years
Material 1: It's best not to quit; if you pay for 15 years until retirement, you can receive a pension. If there's a special reason you must quit, you can continue paying individually.
Material 2: Hello! Once pension insurance has been paid for 15 years and you reach retirement age, you can receive a pension.
Material 3: In life, everyone pays social security. How many years before retirement... regarding how many years you need to pay for social security to get a pension, it was introduced in the text above.

Compared to WebQA, the training set provided by SOGOU has much more noise, which increases the prediction difficulty. Furthermore, I believe this WebQA-style task leans more toward retrieval matching and preliminary semantic understanding techniques. This differs significantly from foreign tasks like SQuAD (one long passage + multiple questions). In the SQuAD corpus, some questions involve complex reasoning, so the models at the top of the SQuAD leaderboard are typically more complex and massive.

The Model

Now, let's formally enter the introduction of the model~

Architecture Overview

First, let's look at a general diagram of the model.

DGCNN General Diagram

As seen from the schematic, as a "Reading Comprehension" or "Question Answering" model, the structure is almost as simple as it gets.

The overall architecture of the model originates from the WebQA reference paper "Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question". This paper has several characteristics:

1. It directly encodes the question with an LSTM to get a "question encoding," then concatenates it with each word vector in the material;
2. It manually extracts 2 co-occurrence features;
3. it treats the final prediction as a sequence labeling task, solved with a CRF.

DGCNN essentially follows this line of thought. Our differences lie in:

1. Replacing all LSTM parts in the original model with CNNs;
2. Extracting richer co-occurrence features (8 in total);
3. Removing the CRF and changing it to "0/1 labeling" to separately identify the start and end positions of the answer, which can be seen as a "half-pointer, half-labeling" structure.

Convolutional Structure

In this part, we analyze the Conv1D Block shown in the diagram.

Gated Mechanism

The convolutional structure used in the model comes from Facebook's "Convolutional Sequence to Sequence Learning", which was also mentioned in "Sharing a Slide: Fancy Natural Language Processing". Assume the vector sequence we want to process is $\boldsymbol{X}=[\boldsymbol{x}_1,\boldsymbol{x}_2,\dots,\boldsymbol{x}_n]$. We can add a gate to a standard 1D convolution:

\[\boldsymbol{Y}=\text{Conv1D}_1(\boldsymbol{X}) \otimes \sigma\Big(\text{Conv1D}_2(\boldsymbol{X})\Big)\tag{1}\]

Note that the two Conv1D forms are identical (e.g., same number of kernels, window size), but the weights are not shared. This means the parameters are doubled. One uses a sigmoid activation, while the other has no activation function. They are then multiplied element-wise. Since the range of the sigmoid function is $(0,1)$, intuitively, it acts as a "valve" for each output of the Conv1D to control the flow. This is the GCNN structure, or it can be viewed as an activation function called GLU (Gated Linear Unit).

Gated Conv with Residual

In addition to its intuitive meaning, an advantage of GCNN is the lower risk of vanishing gradients because one part of the convolution does not have any activation function. If the input and output dimensions are the same, we add the input as well, i.e., use a residual structure:

\[\boldsymbol{Y}=\boldsymbol{X} + \text{Conv1D}_1(\boldsymbol{X}) \otimes \sigma\Big(\text{Conv1D}_2(\boldsymbol{X})\Big)\tag{2}\]

It is worth mentioning that we use residual structures not just to solve vanishing gradients, but to allow information to be transmitted through multiple channels. We can rewrite the above equation into a more illustrative equivalent form to see clearly how information flows:

\begin{align}\boldsymbol{Y}=&\boldsymbol{X}\otimes \Big(1-\boldsymbol{\sigma}\Big) + \text{Conv1D}_1(\boldsymbol{X}) \otimes \boldsymbol{\sigma}\\\ \boldsymbol{\sigma} =& \sigma\Big(\text{Conv1D}_2(\boldsymbol{X})\Big) \end{align}\tag{3}

From equation $(3)$, we can see the flow of information more clearly: it passes directly through with a probability of $1-\sigma$ and passes through after transformation with a probability of $\sigma$. This form is very similar to the GRU model in recurrent neural networks.

Supplementary Derivation:
\begin{align} \boldsymbol{Y}=&\boldsymbol{X}\otimes \Big[1-\sigma\Big(\text{Conv1D}_2(\boldsymbol{X})\Big)\Big] + \text{Conv1D}_1(\boldsymbol{X}) \otimes \sigma\Big(\text{Conv1D}_2(\boldsymbol{X})\Big)\\ =&\boldsymbol{X} + \Big(\text{Conv1D}_1(\boldsymbol{X}) - \boldsymbol{X}\Big)\otimes \sigma\Big(\text{Conv1D}_2(\boldsymbol{X})\Big) \end{align}
Since $\text{Conv1D}_1$ has no activation function, it is just a linear transformation. Thus, $\text{Conv1D}_1(\boldsymbol{X}) - \boldsymbol{X}$ can be combined into a single equivalent $\text{Conv1D}_1$. Simply put, during training, whatever $\text{Conv1D}_1(\boldsymbol{X}) - \boldsymbol{X}$ can achieve, $\text{Conv1D}_1(\boldsymbol{X})$ can also achieve. Thus, $(2)$ and $(3)$ are equivalent.

Dilated Convolution

Next, to enable the CNN model to capture longer distances without increasing model parameters, we use Dilated Convolution.

Standard vs Dilated Convolution

Consider a three-layer CNN (the first layer is the input) with a window size of 3. In the third layer, each node in a standard convolution can only capture 3 inputs from the previous layer. In contrast, a dilated convolution in the third layer can capture 7 inputs, yet the number of parameters and speed remain unchanged. This is because, in the second layer of convolution, the dilated convolution skips the input directly adjacent to the center and captures the center and the next-adjacent input (dilation rate of 2). This can be seen as a "window size of 5 with two holes." In the third layer, it skips three inputs (dilation rate of 4), which can be seen as a "window size of 9 with 6 holes." If you draw lines connecting the relevant inputs and outputs, you will find any node in the third layer is connected to 7 original inputs.

Following the principle of "trying not to overlap or miss," the dilation rates of dilated convolutions generally grow in a geometric progression like 1, 2, 4, 8, ... Of course, "trying" is the keyword, as there is still some overlap. This proportion is inspired by Google's WaveNet model.

Block

Now we can explain the Conv1D Blocks in the diagram. If the input and output dimensions match, it follows the dilated version of formula $(3)$. If they do not match, it is a simple version of formula $(1)$. The window sizes and dilation rates are noted in the diagram.

Attention

As seen in the schematic, in the DGCNN model, Attention is mainly used instead of simple Pooling to integrate sequence information. This includes encoding the question vector sequence into a single global question vector and encoding the material sequence into a global material vector. The Attention used here is slightly different from the one in "Attention is All You Need". This can be considered a "hard additive attention" of the form:

\begin{align}\boldsymbol{x}&=\text{Encoder}\big(\boldsymbol{x}_1,\boldsymbol{x}_2,\dots,\boldsymbol{x}_n\big)=\sum_{i=1}^n \lambda_i \boldsymbol{x}_i\\ \lambda_i&=\mathop{\text{softmax}}_i\Big(\boldsymbol{\alpha}^{\top}\,\text{Act}\big(\boldsymbol{W}\boldsymbol{x}_i\big)\Big)\end{align}\tag{4}

Here $\boldsymbol{\alpha}, \boldsymbol{W}$ are trainable parameters. $\text{Act}$ is an activation function, usually $\tanh$, though the $\text{swish}$ function can also be considered. When using $\text{swish}$, it's best to add the bias term:

\[\lambda_i=\mathop{\text{softmax}}_i\Big(\boldsymbol{\alpha}^{\top}\,\text{Act}\big(\boldsymbol{W}\boldsymbol{x}_i+\boldsymbol{b}\big)+\beta\Big)\tag{5}\]

This Attention scheme is referenced from the R-Net model. (Note: It might not have originated with R-Net, but that's where I learned it.)

Position Embedding

To enhance the CNN's sense of position, we also add Position Embeddings, concatenated to each word vector in the material. The construction method directly follows the plan in "Attention is All You Need":

\[\left\{\begin{aligned}&PE_{2i}(p)=\sin\Big(p/10000^{2i/{d_{pos}}}\Big)\\ &PE_{2i+1}(p)=\cos\Big(p/10000^{2i/{d_{pos}}}\Big) \end{aligned}\right.\tag{6}\]

Output Design

This part is a distinctive feature of our entire model.

Analysis

By now, the overall structure of the model should be clear. First, we encode the question into a fixed vector through convolution and attention. This vector is concatenated to each word vector of the material, along with position embeddings and handcrafted features. We then obtain a feature sequence mixing question and material information, which is processed directly. No further interaction with the question is needed, as subsequent convolutional layers handle the encoding and labeling.

In the SQuAD evaluation, materials definitely contain the answer, and the positions are labeled. SQuAD models typically perform two softmaxes over the sequence to predict the start and end positions, often called a "Pointer Network." However, in our WebQA-style QA, the material does not necessarily contain the answer. Therefore, we do not use softmax but rather use sigmoid on the entire sequence. This allows for cases where the answer doesn't exist or appears multiple times.

Double Labeling Output

Since we use labeling, theoretically, the simplest plan is to output a 0/1 sequence for each word ("is (1)" or "is not (0)" the answer). However, this doesn't work well because an answer might consist of multiple consecutive different words. Forcing the model to label all these words exactly the same might be too difficult. Thus, we use two labeling passes to mark the start and end positions separately.

\begin{align}p^{start}_i = \sigma\Big(\boldsymbol{\alpha}_1^{\top}\,\text{Act}\big(\boldsymbol{W}_1\boldsymbol{x}_i+\boldsymbol{b}_1\big)+\beta_1\Big)\\ p^{end}_i = \sigma\Big(\boldsymbol{\alpha}_2^{\top}\,\text{Act}\big(\boldsymbol{W}_2\boldsymbol{x}_i+\boldsymbol{b}_2\big)+\beta_2\Big)\end{align}\tag{7}

As a result, the output design of the model is different from both the pointer method and pure sequence labeling, or rather, a simplification and fusion of both.

Global Perspective

Finally, to increase the "global view" of the model, we encode the material sequence into a global vector, followed by a fully connected layer to get a global score. This global score is then multiplied by the previous labels:

\begin{align}\boldsymbol{o}=&\text{Encoder}\big(\boldsymbol{x}_1,\boldsymbol{x}_2,\dots,\boldsymbol{x}_n\big)\\ p^{global}=&\sigma\Big(\boldsymbol{W}\boldsymbol{o}+\boldsymbol{b}\Big)\\ p^{start}_i =& p^{global}\cdot\sigma\Big(\boldsymbol{\alpha}_1^{\top}\,\text{Act}\big(\boldsymbol{W}_1\boldsymbol{x}_i+\boldsymbol{b}_1\big)+\beta_1\Big)\\ p^{end}_i =& p^{global}\cdot\sigma\Big(\boldsymbol{\alpha}_2^{\top}\,\text{Act}\big(\boldsymbol{W}_2\boldsymbol{x}_i+\boldsymbol{b}_2\big)+\beta_2\Big)\end{align}\tag{8}

This global score is significantly important for model convergence and performance. Its role is to better judge whether an answer exists in the material. Once the material lacks an answer, setting $p^{global}=0$ suffices, rather than "struggling" to force every word's label to 0.

Handcrafted Features

Earlier in the article, we mentioned handcrafted features multiple times. How effective are they? Based on simple observation, these few features might improve model performance by more than 2%! This shows that well-designed features are crucial for improving performance and reducing model complexity.

The features are designed for words in the material (Q for question, E for evidence/material).

Q-E Full Match

This determines whether a word in the material appears in the question (1 if yes, 0 if no). The idea is to tell the model where the question words appear in the material, as the answer is likely nearby. This matches human logic for reading comprehension.

E-E Co-occurrence

This feature calculates the proportion of times a material word appears in other materials. For example, if there are 10 material segments and word $w$ appears in 4 of the other 9 segments, the word $w$ gets a feature value of 4/10.

The idea is that the more segments a word appears in, the more likely it is to be the answer.

Q-E Soft Match

Using the question size as a window, calculate Jaccard similarity and relative edit distance for each window in the material.

For example, for the question "What is the altitude of Baiyun Mountain?", and material "Baiyun Mountain is located in Guangzhou, the main peak altitude is 382 meters." If the question has 6 words, the window size is 6. We split the material:

X X X Baiyun Mountain is located in
X X Baiyun Mountain is located in Guangzhou
X Baiyun Mountain is located in Guangzhou ,
Baiyun Mountain is located in Guangzhou , the
...and so on.

Where X represents a placeholder. We can then calculate the Jaccard similarity between each block and the question. The similarity result serves as a feature for the current word. Similarly, we calculate the edit distance and divide by the window size to get a number between 0 and 1, which I call "relative edit distance."

Jaccard similarity is unordered, while edit distance is ordered. Thus, these two methods measure similarity from both unordered and ordered perspectives. The idea is identical to the first feature: telling the model which part of the material is similar to the question, as the answer is likely nearby.

The main idea for these two features came from Yin-shen in the Keras group. Thanks!

Character Features

In top SQuAD models, word and character vectors are typically both input. To improve results, we should also input character-level information. But we didn't want the model to become too massive, so we integrated character-level features here.

The idea is simple: the four features described above are calculated per word. They can also be calculated per character, then the results of characters within each word are averaged to serve as a feature for that word. For example, in "Q-E full match," if the question only has the word "acting" and the material has "co-acting," at the word level, they don't match (0). But at the character level, "co-acting" is split into characters. "acting" match would be 1, "co" would be 0, and the average would be a character-level Q-E match feature for the word "co-acting."

Applying this to all four features gives us another 4 features, totaling 8 handcrafted features.

Implementation

Now, all parts of the model have been explained. The overall model is simple and clear, giving an "Great truth is always simple" feeling. Below are the implementation details.

Model Settings

Chinese Tokenization

As seen from the introduction, this model is word-based and introduces character-level information through handcrafted features. However, to make the model more flexible and able to answer more questions, I only performed basic tokenization to keep granularity as low as possible.

Specifically: I wrote a tokenization module based on a unigram model, with a self-prepared dictionary of about 500k words. All English and numbers are split into single letters and digits. For example, "apple" becomes five "words": a p p l e; "382" becomes three "words": 3 8 2.

Since there is no new word discovery, the total vocabulary does not exceed 500k. In fact, our final model's total vocabulary was only about 300k.

Parameters

1. Word vectors are 128 dimensions, pre-trained using Word2Vec (Skip Gram, window 5, 8 negative samples, 8 iterations) on provided training data, WebQA corpus, 500k Baidu Baike entries, and 1 million Baike Zhidao questions. Training took about 12 hours;
2. Padding tokens use all-zero vectors; word vectors remain fixed during DGCNN training;
3. All Conv1D output dimensions are 128; position vectors are 128;
4. Maximum sequence length is 100; masking is used for padding in batches;
5. Since it's a binary labeling format with class imbalance, binary focal loss is used as the loss function;
6. Adam optimizer is used: first at $10^{-3}$ until optimal (around 6 epochs), then at $10^{-4}$ (3 epochs).

Regularization

Late in the competition, we found that a DropPath-like regularization slightly improved performance. This technique is built on equation $(3)$; the idea is to perturb the "gate" during training:

Where $\boldsymbol{\varepsilon}$ is a uniform random noise tensor in $[-0.1, 0.1]$. By adding "multiplicative noise" to the GCNN door, the model becomes more robust against parameter perturbations.

This was inspired by regularization techniques in "FractalNet" and "Shake-Shake regularization".

Data Preparation

Data Pre-processing

Since SOGOU allowed external data, we and most teams used the WebQA dataset. Given that WebQA is more structured while SOGOU's data is noisier, we mixed them at a 2:1 ratio.

The provided data is "one question + multiple materials + one answer" without specifying where the answer is in the materials. Thus, we treated all substrings matching the answer as labels. While sometimes unreasonable, this was the best approach without additional manual labeling.

Another issue is synonymous answers. SOGOU's evaluation script considers synonyms, so we extracted them from the script to label all synonymous answers in the materials.

Data Shuffling

SOGOU provided 30,000 labeled questions, pre-divided into 25,000 training and 5,000 validation. However, the pre-divided validation set didn't match the online test set's distribution well. We mixed and reshuffled all data, taking 20,000 for training and 10,000 for validation. This resulted in a validation score (~0.76) close to the online leaderboards.

Data Augmentation

Three operations were used during training:

1. Randomly zeroing out word IDs in questions/materials (analogous to replacing words with [UNK]);
2. Repeatedly concatenating and randomly cropping materials to create new segments;
3. For materials with multiple answers, randomly removing some answer labels.

The first method significantly improved stability and accuracy. The others had a minor effect.

Decoding Strategy

A detail many contestants overlook: Answer decoding has huge optimization space. Improving decoding can yield more gains than repeated hyperparameter tuning!

Scoring Method

What is answer decoding? Whether using softmax pointers or sigmoid labeling, the model outputs two columns of floats for start and end scores. How do we determine the answer interval? The standard way is to set a max length (e.g., 10), traverse all intervals, and compute a score (sum or product of start/end). Is "sum" or "product" better? I found "product" worked much better than "root of product" (1% gain).

Voting Method

If the same fragment appears multiple times in a passage, should we sum, average, or take the max? Across different passages, how do we vote? For example, with scores (A, 0.7), (B, 0.2), (B, 0.2), (B, 0.2), (B, 0.2), should we pick A or B? Some say B because 4 * 0.2 = 0.8 > 0.7. But A is nearly 1—a "master."

Our voting reflects two points: 1. numbers matter, 2. $1+1 < 2$. We used "square sum":

1. Within a passage, if a fragment appears multiple times, take only the max score;
2. Across passages, sum the squared scores of the same answer:
$$s_a = \sum_{i=1}^n s^2_{a,i}$$
Squaring amplifies the weight of high-score samples.
3. In the competition, I actually used a slightly different formula which is a bit more smoothed:
$$s_a = \frac{\sum\limits_{i=1}^n s^2_{a,i}}{1+\sum\limits_{i=1}^n s_{a,i}}$$

Model Ensemble

With the above, a score of 0.74-0.75 is achievable. To reach 0.7583, you need model ensemble. We used single-model ensemble through K-fold cross-validation, training the same architecture multiple times on different folds and averaging the results.

Conclusion

Evaluation

The model reached 0.7583 on the noisy SOGOU test set. Given the noise, I suspect the actual performance in clean scenarios is even higher. Being a pure CNN model, it is lightweight and ready for industrial use. On SQuAD, it scored around 50-60% without tuning, showing that WebQA-style and SQuAD-style tasks are quite different.

Code & Testing

The model is online on the Flame Technology official website for testing:
http://www.birdbot.cn/online-factual-qa.html

The code is not public as it was done for a company. However, the model is simple to implement based on this article. The SOGOU data is also not ours to share, but WebQA data can be used for testing.

Hundreds of Trials

The development involved hundreds of iterations and experiments. While not a formal paper, if this helps you, please cite this article.

Finally, thanks to Guangzhou Flame Technology for the support in hardware and software.

PS: Later, I discovered this model "collided" with "Fast Reading Comprehension with ConvNets" and "QANET", though I had not referenced them at the time. I started from the WebQA paper and tried CNNs out of curiosity, and never looked back.