The Significance of Training, Validation, and Test Sets

By 苏剑林 | Oct 14, 2017

In supervised machine learning, we often speak of training sets (train), validation sets (validation), and test sets (test). The distinction between these three sets can be confusing, especially for readers who are unclear about the difference between the validation set and the test set.

Partitioning

If we already have an existing large labeled dataset and want to complete a test of a supervised model, we typically use a uniform random sampling method to partition the dataset into a training set, a validation set, and a test set. These three sets must not have any overlap. A common ratio is 8:1:1, though the proportion is arbitrary. From this perspective, all three sets are identically distributed.

If it is for a competition where the organizers provide a labeled dataset (as a training set) and an unlabeled test set, we usually manually partition a validation set from the training set ourselves. In this case, we typically do not partition a further test set. There are two likely reasons for this: 1. Competition organizers are generally very stingy, so the samples in the training set are already scarce; 2. We cannot guarantee whether the test set to be submitted is perfectly identically distributed with the training set, so partitioning another test set that is identically distributed with the training set would not be very meaningful.

Parameters

Once you have a model, the training set is used to train parameters—to be precise, it is generally used for gradient descent. The validation set is essentially used after each epoch to test the accuracy of the current model. Because the validation set has no overlap with the training set, this accuracy is reliable. So why is a test set still needed?

This requires a distinction between different types of model parameters. In fact, for a model, parameters can be divided into ordinary parameters and hyperparameters. Assuming we are not introducing reinforcement learning, ordinary parameters are those that can be updated via gradient descent—that is, the parameters updated by the training set. Additionally, there is the concept of hyperparameters, such as the number of network layers, the number of nodes per layer, the number of iterations, the learning rate, and so on. These parameters are not within the scope of gradient descent updates. Although there are currently some algorithms used to search for model hyperparameters, in most cases, we still manually tune them based on the validation set.

So

In a narrow sense, the validation set does not participate in the gradient descent process, which is to say it has not been "trained." However, in a broad sense, the validation set participates in a "manual tuning" process. We adjust the number of iterations, adjust the learning rate, etc., based on the results from the validation set, ensuring that the result is optimal on the validation set. Therefore, we can also consider that the validation set has participated in training.

Thus, it becomes clear: we still need a set that has completely never participated in training, and that is the test set. We use the test set neither for gradient descent nor for controlling hyperparameters; it is only used to test the final accuracy after the model is ultimately finished training.

However

Clever readers might see the analogy that this is actually an endless process. If the accuracy on the test set is very poor, we will still go back and adjust various parameters of the model. At this point, the test set could also be considered to have participated in training. Well, perhaps we need a "test-test set," and maybe a "test-test-test set"...

Forget it, let's just stop at the test set.