Most computational models rely on labelled data for training, but labelled data is not always easily available for given subjects. Additionally, it requires human oversight, time and cost involvement to label data. On the contrary, unlabelled data is available in plenty yet and is deemed useless. Carnegie Mellon University has constantly worked to generate such unlabelled data and prove that unlabelled data boosts learning accuracy for various problems. In their most recent research, CMU has explored how we can “tap into rich information in unlabelled data and leverage them to assess a model’s performance without labels”.
In machine learning, the generalisation performance of a learning algorithm is the performance on out-of-sample data of the models learned by the algorithm. It assesses a model’s ability to process new data and generate accurate predictions after training on a training set. Traditionally, generalisation performance is measured by a supervised method of dividing labelled data into training sets, collections of examples to train the network on, validation set, fine-tune hyperparameters, and test sets, to evaluate the performance. The research demonstrates a simple procedure that can accurately estimate the generalisation error with only unlabelled data.
Stochastic Gradient Descent
The method leverages SGD; a popular neural network optimisation algorithm, which is a technique that computes the error and updates the model for each example in the training dataset. The model is updated for each training example, and thus different runs of SGD find different solutions. If the results are not perfect, they will disagree with each other on some of the unseen data. CMU states this disagreement can be used to estimate generalisation error without labels.
Sign up for your weekly dose of what's up in emerging technology.
The paper suggests running SGD on a given model. The hyperparameters on the training data should be the same, but the random seeds are different – this will get two different solutions. They should be measured to test how often the networks’ predictions disagree on a new unlabelled test dataset.
The team’s research estimated the test error with unlabelled data. They found the disagreement rate to be approximately equal to the average test error over the two models. This builds on and is a stronger version of Nakkiran and Bansal’s (2020) observation that noted the disagreement rate on the test data to be nearly equal to the average test error. The 2020 research also concluded the requirement of the second run to be on an altogether fresh training set. Thus, the team said, “our observation generalises prior work by showing that the same phenomenon holds for small changes in hyperparameters and, more importantly, the same dataset, which makes the procedure relevant for practical applications.” Furthermore, the team’s disagreement rate is a meaningful estimator of test accuracy given the calculation required only a new unlabelled dataset, not a new labelled dataset, like in N&B’20.
The team’s mathematical formula shows the triangle inequality as 0 ≤ E [ h ( x ) ≠ h ′ ( x ) ] ≤ E [ h ( x ) ≠ y ] + E [ h ′ ( x ) ≠ y ] . Here, h ( x ) is a prediction of classifier and y is the true label. “By the triangle inequality, the disagreement rate can be anywhere between 0 and 2 times the test error”, the paper notes.
The final parameters found by SGD depended on a few random variables, including random initialisation, random ordering of a fixed training dataset and random sampling of the dataset. Understanding which randomness is responsible for specific properties is integral to understanding the phenomenon occurring. For the research, this was enabled by observing the change in the behaviour of disagreement when different randomness sources are isolated. Furthermore, fixing one variable while changing another allowed the team to examine the effects of different initialisations.
The disagreement was consistently close to the test error through various randomnesses. Furthermore, this method used only one set of training data on two copies of the model and empirically observed the phenomenon across different model architectures and image recognition benchmarks.
While the research explains the observation, the question of how and why a single pair of models is sufficient still stands. The proposed method can be leveraged to estimate the generalisation error of black-box deep neural networks with only unlabelled data. But they also note practically a single pair of models is sufficient for accurately estimating the generalisation error. However, this research broadens the horizons of leveraging unlabelled data to estimate the generalisation error.