Start with Nonsense, Fabricate the Data? Truly Outraged by a "Divine Paper"

By 苏剑林 | December 04, 2021

This article discusses my experience of being outraged by a "divine paper" that came out yesterday.

This "divine paper" is 《How not to Lie with a Benchmark: Rearranging NLP Leaderboards》. The general content of the paper argues that many current leaderboards use the arithmetic mean for averaging, while it contends that the geometric mean and harmonic mean are more reasonable. Most critically, it re-calculates the rankings for models on leaderboards like GLUE and SuperGLUE using geometric and harmonic means, claiming to find that models which previously surpassed humans no longer do so under the new averaging schemes.

Doesn't that sound quite interesting? I thought so too, which is why I planned to write a blog post to introduce it. However, as I was finishing the post and cross-checking the data, I discovered that the data in the tables was completely fabricated!!! The actual results do not support its conclusions at all!!! Consequently, this blog post has turned from a "commendation ceremony" into a "criticism session"...

Talking Nonsense

First, let's look at the first table from this "divine paper," which concerns partial results from the GLUE leaderboard:

GLUE leaderboard calculation results from the 'divine paper'
Calculation results for the GLUE leaderboard in the "divine paper"

Setting everything else aside, the fact that the "semi-colon/comma" (,) and "decimal point" (.) are confused in the tables of this "divine paper" is disgusting enough (this is even worse in the SuperGLUE table below). But if it were just minor issues like that, I might have endured it. What is truly intolerable is that the calculation rules for AM (Arithmetic Mean), GM (Geometric Mean), and HM (Harmonic Mean) within it are completely "arbitrary"!

I experimented for a long time and finally figured out the calculation rules for this table:

1. All AMs are calculated using the scores of the first 10 tasks (even though the table above only shows the first 8 tasks);

2. The GM and HM for the Human row are calculated using the scores of the first 10 tasks;

3. The GM and HM for the other model rows are calculated using the scores of all 11 tasks.

Since the score for the 11th task is lower than the others, calculating the models' GM and HM this way makes them lower than the Human scores. The authors then directly concluded that under GM and HM, human performance remains in first place. In fact, if everyone were measured against the same set of tasks, there would be basically no difference in rankings between AM, GM, and HM. Moreover, anyone with a slightly normal mathematical mind can see the impropriety of these results: model performance significantly exceeds humans on many tasks and is only slightly lower on a few. Therefore, any normal averaging algorithm couldn't possibly conclude that humans far exceed models, yet the authors simply believed it...

The same error appears in SuperGLUE:

SuperGLUE leaderboard calculation results from the 'divine paper'
Calculation results for the SuperGLUE leaderboard in the "divine paper"

The calculation rules are as follows:

1. All AMs are calculated using the scores of the first 8 tasks;

2. All GMs and HMs are calculated using the scores of all 10 tasks.

In fact, if AM were also calculated using the scores of the 10 tasks, human performance would also rank first according to the AM ranking. In other words, as long as everyone's calculation standards are the same, the rankings for AM, GM, and HM show no significant difference.

Truly Helpless

Incidentally, this paper was accepted into a NeurIPS 2021 Workshop. Although Workshops are usually far inferior to formal papers, they shouldn't be a mess to this extent. Looking at the title of this paper again, I wonder if it should be changed to "How not to Lie with this paper"?

It seems that in the future, when we read papers, we must not only care about the reproducibility of the results but also pay attention to whether their summations, means, and variances are calculated correctly~ Truly, every kind of bizarre possibility exists~

Originally published at: https://kexue.fm/archives/8783