MITB Banner

Research shows unlabelled data can help in estimating generalisation 

The research demonstrates a simple procedure that can accurately estimate the generalisation error with only unlabelled data.

Share

Most computational models rely on labelled data for training, but labelled data is not always easily available for given subjects. Additionally, it requires human oversight, time and cost involvement to label data. On the contrary, unlabelled data is available in plenty yet and is deemed useless. Carnegie Mellon University has constantly worked to generate such unlabelled data and prove that unlabelled data boosts learning accuracy for various problems. In their most recent research, CMU has explored how we can “tap into rich information in unlabelled data and leverage them to assess a model’s performance without labels”. 

In machine learning, the generalisation performance of a learning algorithm is the performance on out-of-sample data of the models learned by the algorithm. It assesses a model’s ability to process new data and generate accurate predictions after training on a training set. Traditionally, generalisation performance is measured by a supervised method of dividing labelled data into training sets, collections of examples to train the network on, validation set, fine-tune hyperparameters, and test sets, to evaluate the performance. The research demonstrates a simple procedure that can accurately estimate the generalisation error with only unlabelled data. 

Stochastic Gradient Descent

The method leverages SGD; a popular neural network optimisation algorithm, which is a technique that computes the error and updates the model for each example in the training dataset. The model is updated for each training example, and thus different runs of SGD find different solutions. If the results are not perfect, they will disagree with each other on some of the unseen data. CMU states this disagreement can be used to estimate generalisation error without labels.

The paper suggests running SGD on a given model. The hyperparameters on the training data should be the same, but the random seeds are different – this will get two different solutions. They should be measured to test how often the networks’ predictions disagree on a new unlabelled test dataset.

The team’s research estimated the test error with unlabelled data. They found the disagreement rate to be approximately equal to the average test error over the two models. This builds on and is a stronger version of Nakkiran and Bansal’s (2020) observation that noted the disagreement rate on the test data to be nearly equal to the average test error. The 2020 research also concluded the requirement of the second run to be on an altogether fresh training set. Thus, the team said, “our observation generalises prior work by showing that the same phenomenon holds for small changes in hyperparameters and, more importantly, the same dataset, which makes the procedure relevant for practical applications.” Furthermore, the team’s disagreement rate is a meaningful estimator of test accuracy given the calculation required only a new unlabelled dataset, not a new labelled dataset, like in N&B’20.

Methodology

The team’s mathematical formula shows the triangle inequality as 0 ≤ E [ h ( x ) ≠ h ′ ( x ) ] ≤ E [ h ( x ) ≠ y ] + E [ h ′ ( x ) ≠ y ] . Here, h ( x ) is a prediction of classifier and y is the true label. “By the triangle inequality, the disagreement rate can be anywhere between 0 and 2 times the test error”, the paper notes.

The final parameters found by SGD depended on a few random variables, including random initialisation, random ordering of a fixed training dataset and random sampling of the dataset. Understanding which randomness is responsible for specific properties is integral to understanding the phenomenon occurring. For the research, this was enabled by observing the change in the behaviour of disagreement when different randomness sources are isolated. Furthermore, fixing one variable while changing another allowed the team to examine the effects of different initialisations.

The disagreement was consistently close to the test error through various randomnesses. Furthermore, this method used only one set of training data on two copies of the model and empirically observed the phenomenon across different model architectures and image recognition benchmarks. 

In conclusion

While the research explains the observation, the question of how and why a single pair of models is sufficient still stands. The proposed method can be leveraged to estimate the generalisation error of black-box deep neural networks with only unlabelled data. But they also note practically a single pair of models is sufficient for accurately estimating the generalisation error. However, this research broadens the horizons of leveraging unlabelled data to estimate the generalisation error. 

Share
Picture of Avi Gopani

Avi Gopani

Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.