How Many Grades Can BERT Attend? A "Hardline" Seq2Seq Approach to Primary School Math Word Problems

By 苏剑林 | October 19, 2020

Those years of "Chickens and Rabbits in a Cage," "Surplus and Deficiency Problems," "Age Problems," "Tree Planting Problems," "Cattle Eating Grass Problems," "Profit Problems"... During your primary school years, were you ever tortured by all kinds of fancy mathematical word problems? It's okay; now machine learning models can help us solve word problems too. Let's see what grade level it can reach! This article will provide a baseline for solving primary school Math Word Problems (MWP), trained on the ape210k dataset, directly using a Seq2Seq model to generate executable mathematical expressions. Ultimately, the Large version of the model achieved an accuracy of 75%, significantly higher than the results reported in the ape210k paper. The so-called "hardline" approach refers to directly generating human-readable expressions similar to how a person would solve them, without special expression transformations or template processing.

Data Processing

First, let's observe the situation of the ape210k dataset:

{
 "id": "254761",
 "segmented_text": "小 王 要 将 150 千 克 含 药 量 20% 的 农 药 稀 释 成 含 药 量 5% 的 药 水 ． 需 要 加 水 多 少 千 克 ？",
 "original_text": "Xiao Wang wants to dilute 150 kg of pesticide with a 20% drug content into a pesticide solution with a 5% drug content. How many kilograms of water need to be added?",
 "ans": "450",
 "equation": "x=150*20%/5%-150"
}
{
 "id": "325488",
 "segmented_text": "一 个 圆 形 花 坛 的 半 径 是 4 米 ， 现 在 要 扩 建 花 坛 ， 将 半 径 增 加 1 米 ， 这 时 花 坛 的 占 地 面 积 增 加 了 多 少 米 * * 2 ．",
 "original_text": "The radius of a circular flower bed is 4 meters. Now the flower bed is to be expanded by increasing the radius by 1 meter. How many square meters (m**2) has the area of the flower bed increased by?",
 "ans": "28.26",
 "equation": "x=(3.14*(4+1)**2)-(3.14*4**2)"
}

As we can see, we are primarily concerned with the original_text, equation, and ans fields. original_text is the problem, equation is the calculation process (usually starting with x=), and ans is the final answer. We want to train a model to generate the equation from the original_text, and then use Python's eval function to obtain the ans.

However, we need to do some preprocessing because the equation provided by ape210k cannot always be evaluated directly. For example, in the case above, 150*20%/5%-150 is an illegal expression for Python. The processing I performed is as follows:

For percentages like a%, uniformly replace them with (a/100);
For mixed fractions like a(b/c), uniformly replace them with (a+b/c);
For proper fractions like (a/b), remove the parentheses in the problem to become a/b;
For the colon : used in ratios, uniformly replace it with /.

After this processing, most equation strings can be evaluated directly, and the results can be compared with ans to retain only problems where the results match. However, there is one more point for improvement: the resulting expressions may contain redundant parentheses (i.e., removing the parentheses yields an equivalent result). Therefore, a step for removing parentheses is added: iterate through every pair of parentheses, and if removing them results in an expression equivalent to the original, remove them. This yields shorter average expression lengths, and shorter sequences are easier to generate.

Ultimately, we obtained the following usable dataset:

\[ \begin{array}{c|ccc} \hline & \text{Training Set} & \text{Validation Set} & \text{Test Set} \\ \hline \text{Original Count} & 200488 & 5000 & 5000\\ \hline \text{Retained Count} & 200390 & 4999 & 4998\\ \hline \end{array} \]

The remaining items are basically erroneous or messy problems and are ignored for now.

Model Introduction

The model itself is the part least worth talking about. It uses original_text as input and equation as output, based on the "BERT+UniLM" architecture to train a Seq2Seq model. If you have any doubts about the model, please read "From Language Models to Seq2Seq: Transformer is all about the Mask".

Project Link: http://github.com/bojone/ape210k_baseline

My training was conducted on a single 22G TITAN RTX card. The optimizer is Adam, and the learning rate is 2e-5. The Base version used a batch_size=32 and required about 25 epochs, with each epoch taking about 50 minutes (including validation set evaluation time); the Large version used a batch_size=16 and required about 15 epochs, with each epoch taking about 2 hours (including validation set evaluation time).

Regarding "Large," since UniLM borrows weights from the MLM part, we cannot use the RoBERTa-wwm-ext-large released by HFL because the MLM weights in that version are randomly initialized (though its Base version is normal and usable). For the Large version, I recommend the weights released by Tencent UER. They were originally in PyTorch format, but I have converted them to TF format, which can be downloaded here (extraction code: l0k6).

The results are shown in the following table:

\[ \begin{array}{c|ccc} \hline & \text{beam_size} & \text{Validation Set} & \text{Test Set} \\ \hline \text{Base} & 1 & 71.67\% & 71.65\%\\ \text{Base} & 2 & 71.81\% & 72.27\%\\ \text{Base} & 3 & \textbf{71.85}\% & \textbf{72.35}\%\\ \hline \text{Large} & 1 & 74.51\% & 74.43\%\\ \text{Large} & 2 & 74.97\% & 74.99\%\\ \text{Large} & 3 & \textbf{75.04}\% & \textbf{75.01}\%\\ \hline \end{array} \]

The results of the Large model are significantly higher than the 70.20% reported in the ape210k paper "Ape210K: A Large-Scale and Template-Rich Dataset of Math Word Problems", indicating that our model is a decent baseline. It feels as though using some Seq2Seq techniques to alleviate the Exposure Bias problem (refer to "A Brief Analysis and Countermeasures for the Exposure Bias Phenomenon in Seq2Seq") could further improve the model; also, one could perhaps introduce a copy mechanism to enhance the consistency between outgoing and incoming numbers; additionally, one could find ways to further shorten the sequence length (for example, replacing the four characters 3.14 with two letters pi). I'll leave these for everyone to try~

Standardizing Output

From a purely modeling perspective, our task is already complete—that is, the model only needs to output the equation, and evaluation only requires judging whether the result after eval matches the reference answer. However, from a practical application perspective, we need to further standardize the output; specifically, deciding whether the output should be a decimal, integer, fraction, or percentage based on the problem. This requires us to:

1. Decide when to output which format; 2. Convert the result according to the specified format.

The first step is relatively simple. Generally, it can be determined based on keywords in the problem or the equation. For example, if there are decimals in the expression, the output result is generally also a decimal; if the question asks "how many cars," "how many items," "how many people," etc., the output should be an integer; if it directly asks "what fraction" or "what percent," then naturally it should be a fraction or percentage. The more difficult part should be rounding problems, such as "Each cake costs 7.90 yuan. What is the maximum number of cakes you can buy with 50 yuan?" which requires us to perform floor division on 50/7.90, but sometimes ceiling division is needed. However, quite surprisingly to me, there are no rounding-type problems in ape210k, so this issue doesn't exist there. If you encounter a dataset with rounding, and rule-based judgment is difficult, the most direct method is to add rounding symbols into the equation for the model to predict.

The second step looks a bit complicated, mainly in the scenario of fractions. The average reader might not know how to keep fractional operation results for an equation. If you directly eval('(1+2)/4'), you get 0.75 (Python 3), but sometimes we want the fractional result 3/4. In fact, maintaining fractional operations belongs to the realm of CAS (Computer Algebra System), which is to say symbolic computation rather than numerical computation. Python happens to have such a tool, which is SymPy. Using SymPy can achieve our goal. Please see the example below:

from sympy import Integer
import re
r = (Integer(1) + Integer(2)) / Integer(4)
print(r) # Output is 3/4 instead of 0.75
equation = '(1+2)/4'
print(eval(equation)) # Output is 0.75
new_equation = re.sub('(\d+)', 'Integer(\\1)', equation)
print(new_equation) # Output is (Integer(1)+Integer(2))/Integer(4)
print(eval(new_equation)) # Output is 3/4

Summary

This article introduced a baseline for solving math word problems using a Seq2Seq model. The main idea is to directly convert problems into evaluatable expressions through "BERT+UniLM," and then shared some experience on standardizing results. Through the UniLM of the BERT Large model, we reached an accuracy of 75%, exceeding the results released in the original paper.

So, what grade do you think it can attend now~