How to split a validation set that is closer to the test set?

By 苏剑林 | Oct 16, 2020

Whether in competitions, experiments, or engineering, we often encounter situations where the distribution of the training set and the test set is inconsistent. Generally, we split a validation set from the training set to tune hyperparameters (refer to "The Significance of Training, Validation, and Test Sets"), such as controlling the number of training epochs to prevent overfitting. However, if the validation set itself is quite different from the test set, a model that performs well on the validation set does not necessarily perform well on the test set. Therefore, how to make the distribution of the split validation set closer to the test set is a topic worth researching.

Two Cases

First, clarify that what this article considers is a scenario where we can obtain the test set data itself but do not know the test set labels. If it is a scenario where models are submitted for closed evaluation and we cannot see the test set at all, there is not much we can do. Why does the phenomenon of inconsistent training and test set distributions occur? There are mainly two situations.

The first is the inconsistency of label distributions. This means that if we only look at the input $x$, the distribution is basically the same, but the corresponding $y$ distribution is different. A typical example is the information extraction task, where training sets are often constructed using "distant supervision + coarse manual labeling," resulting in large volumes but possibly many errors and omissions. Meanwhile, the test set might be constructed through "repeated manual refined labeling," with very few errors. In this case, it is impossible to construct a better validation set simply by splitting the data.

The second is the inconsistency of input distributions. Simply put, the distribution of $x$ is inconsistent, but the annotation of $y$ is basically correct. For example, in classification problems, the category distribution of the training set may differ from that of the test set; or in reading comprehension problems, the proportion of factual vs. non-factual question types in the training set may differ from that in the test set, and so on. In this case, we can appropriately adjust the sampling strategy to make the validation set distribution more consistent with the test set, so that the results on the validation set better reflect the results on the test set.

Discriminator

To achieve our goal, we let the labels of the training set be 0 and the labels of the test set be 1, and train a binary classifier discriminator $D(x)$:

\[-\mathbb{E}_{x\sim p(x)}[\log (1 - D(x))] - \mathbb{E}_{x\sim q(x)}[\log D(x)]\]

where $p(x)$ represents the distribution of the training set and $q(x)$ represents the distribution of the test set. Note that we are not directly mixing the training and test sets for random sampling training; instead, we sample an equal number of samples from the training and test sets to form each batch, meaning we need to oversample to achieve class balance.

Some readers might worry about overfitting—that the discriminator might completely separate the training and test sets. In fact, when training the discriminator, we should also split a validation set as in ordinary supervised training to determine the number of training epochs via early stopping; or as in some cases online, directly use Logistic Regression as the discriminator, because Logistic Regression is simple enough that the risk of overfitting is smaller.

Similar to the discriminator in a GAN, it is not difficult to derive that the theoretical optimal solution for $D(x)$ is:

\begin{equation}D(x) = \frac{q(x)}{p(x)+q(x)}\label{eq:d}\end{equation}

In other words, after the discriminator is trained, it can be considered equivalent to the relative density of the test set distribution.

Importance Sampling

Whether optimizing a model or calculating metrics, we actually hope to perform these actions on the test set. That is, for a given objective $f(x)$ (such as the model's loss), we hope to calculate:

\[\mathbb{E}_{x\sim q(x)}[f(x)] = \int q(x) f(x) dx\]

However, calculating the objective $f(x)$ usually requires knowing the true label of $x$, which we do not know for the test set, so we cannot calculate it directly. However, we do know the labels for the training set, so we can use importance sampling to solve this:

\[\int q(x) f(x) dx=\int p(x)\frac{q(x)}{p(x)} f(x) dx=\mathbb{E}_{x\sim p(x)}\left[\frac{q(x)}{p(x)} f(x)\right]\]

According to formula $\eqref{eq:d}$, we know that $\frac{q(x)}{p(x)}=\frac{D(x)}{1-D(x)}$, so it finally becomes:

\begin{equation}\mathbb{E}_{x\sim q(x)}[f(x)] = \mathbb{E}_{x\sim p(x)}\left[\frac{D(x)}{1-D(x)} f(x)\right]\label{eq:w}\end{equation}

Simply put, the idea of importance sampling is to "pick out" those samples from the training set that are similar to the test set and assign them higher weights.

Final Strategy

From formula $\eqref{eq:w}$, we can obtain two strategies:

The first is to weight directly according to the formula. That is, still divide the training and validation sets using random shuffling, but assign a weight $w(x)=\frac{D(x)}{1-D(x)}$ to each sample. It is worth noting that a similar approach has already been used by some contestants in competitions, except the weight circulated was $D(x)$. Of course, I cannot assert which one is better, but from the perspective of theoretical derivation, $\frac{D(x)}{1-D(x)}$ should be more reasonable.

The second strategy is to actually sample the corresponding validation set. This is not difficult. Suppose all samples in the training set are $x_1, x_2, \dots, x_N$. We normalize the weights:

\begin{equation}p_i = \frac{w(x_i)}{\sum\limits_{i=1}^N w(x_i)}\end{equation}

Then perform independent repeated sampling according to the distribution $p_1, p_2, \dots, p_N$ until the designated number of samples is reached. Note that independent repeated sampling with replacement is required, so the same sample may be sampled multiple times and should be kept multiple times in the validation set; you cannot remove duplicates, as deduplication would result in inconsistent distributions.

Summary

This article compares the differences between the training set and the test set from the perspective of training a discriminator. Combined with importance sampling, we can obtain a validation set that is closer to the test set, or weight the training samples so that the training set optimization process has less discrepancy with the test set.