How Important Is Labelled Data For Improving Machine Learning Robustness?

Photo by Richard Thomposn for Unsplash

A self-driving car should be accurate — there is no room for second-guessing. A self-driving car’s accuracy improves drastically if it has been trained on data that has been annotated with parameters like colours, shapes, sizes, signs and angles.

The question here is where can one get that kind of data? 

Today, data labelling has become an industry of its own. Developing nations like India have their own data labellers operating out of remote places with minimal education. It is a common notion that more labelled data leads to robust machine learning models. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

However, that’s not always the case. Real-time data comes with its own set of uncertainties and there is the problem of noisy data resulting due to unhealthy data collection. 

So, the reliability of a machine learning model shouldn’t just stop at assessing robustness but also building a diverse toolbox for understanding machine learning models, including visualisation, disentanglement of relevant features, and measuring extrapolation to different datasets or to the long tail of natural but unusual inputs to get a clearer picture.

The researchers also have been trying to find visual corruptions such as (non-adversarial) fog, blur, or pixelation to be rich with solutions to achieve adversarial robustness. 

For example, fog or blur effects on images have emerged as another avenue for measuring the robustness of computer vision models. The robustness to such common corruption is considered to be linked to adversarial robustness and proposes corruption robustness as an easily computed indicator of adversarial robustness.

Despite the promise of adversarial training, its reliance on large numbers of labeled examples has presented a major challenge towards developing robust classifiers. 

In order to assess the importance of annotated data for training, the researchers at DeepMind propose two simple UAT approaches, tested on two standard image classification benchmarks. 

Why Generalise Adversities

One of the most successful approaches for obtaining classifiers that are adversarially robust is adversarial training. A central challenge for adversarial training has been the difficulty of adversarial generalisation. Previous works have argued that adversarial generalisation may simply require more data than natural generalisation. In this paper, researchers at DeepMind pose a simple question of if the labeled data necessary, or is unsupervised data sufficient?

To test this, they have formalised two approaches— Unsupervised Adversarial Training(UAT) with online targets and one with fixed targets.

As per the experiment, the CIFAR-10 training set was first divided into halves, where the first 20,000 examples are used for training the base classifier and the latter 20,000 are used to train a UAT model. Of the latter 20,000, 4,000 examples were treated as labeled, and the remaining 16,000 as unlabeled. 

These experiments reveal that one can reach near state-of-the-art adversarial robustness with as few as 4,000 labels for CIFAR-10 (10 times less than the original dataset) and as few as 1,000 labels for SVHN (100 times less than the original dataset). The authors also demonstrate that their method can be applied to uncurated data obtained from simple web queries. 

This approach improves the state-of-the-art on CIFAR-10 by 4% against the strongest known attack. These findings open a new avenue for improving adversarial robustness using unlabeled data.

Key Takeaways

This work:

  • Addresses more realistic case where unlabeled data is also uncurated, therefore opening a new avenue for improving adversarial training.
  • Posits that unlabeled data can be a competitive alternative to labelled data for training adversarially robust models.
  • Theoretically, shows that in a simple statistical setting, the sample complexity for learning an adversarially robust model from unlabeled data matches the fully supervised case.

Since increasing robustness against one distortion type can decrease robustness against others, measuring performance on different distortions is important to avoid overfitting to a specific type, especially when a defence is constructed with adversarial training is proving to be crucial for the future of machine learning reliability.

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox