Weak Supervision: The Art Of Training ML Models From Noisy Data

The most common approaches in machine learning are supervised and unsupervised learning.
weak supervision

A deep learning model’s performance gets better as the size of the dataset increases. However, there is a catch; deep learning models have hundreds of millions of parameters. Meaning, the models require a large amount of labelled data.

Hand-labelled training sets are expensive and time consuming (from months to years) to create. Some datasets call for domain expertise (eg: medical-related datasets). More often than not, such labelled datasets cannot be even repurposed for new objectives. Given the associated costs and inflexibility of hand-labelling, training sets pose a big hurdle in deploying machine learning models.


Sign up for your weekly dose of what's up in emerging technology.

Enter weak supervision, a branch of machine learning where limited and imprecise sources can be used to label large amounts of training data in a supervised setting. In this approach, inexpensive weak labels are used to create a strong predictive model.

Weak supervision

The most common approaches in machine learning are supervised and unsupervised learning. However, there is a whole spectrum of supervision between the two extremes. Weak supervision lies between fully supervised learning and semi-supervised learning. It can be described as an approach that uses data with noisy labels. These labels are usually generated by a computer by applying heuristics to a signal with the unlabelled data to develop their label.

In its most common form, ML practitioners need a small amount of labelled training data and a large amount of unlabelled data for weak supervision. The goal is to create labels for the unlabeled data so that it can be used to train the model. However, there are two prerequisites: it is necessary that the unlabelled data must contain relevant information and secondly, the developer must generate enough correctly labelled data that it overcomes the noise generated by the weak supervision approach.

There are three types of weak supervision:

Incomplete supervision: Only a small subset of training data is given labels and the other remains unlabelled

Inexact supervision: Only coarse-grained labels are given.

Inaccurate supervision: The given labels may or may not be the groundtruth. It usually happens when the annotator is careless or the data is too difficult to be correctly categorised.

Weak supervision frameworks

Some of the most significant weak supervision frameworks are:

Snorkel: It is an open-source weak supervision framework by Stanford University’s team. Using a small amount of labelled data and a large amount of unlabelled data, Snorkel allows users to write labelling functions in Python for multiple dataset signals. Multiple weak signals from labelled and labelling function-generated labelled data are then used to train a generative model. This model is used to produce probabilistic labels that can in turn train the target model.

Credit: Google AI

ASTRA: It is a weak supervision framework for training deep neural networks. It uses automatically generated weakly labelled data for tasks where collecting large-scale labelled training data is expensive option. ASTRA employs a teacher-student architecture and leverages domain-specific rules, a large amount of labelled data, and a small amount of labelled data. The key components of this framework are:

Credit: ASTRA/Github

  • Weak rules: Expressed as Python-labelling functions, these are domain-specific rules that rely on heuristics for annotating text instances with weak labels.
  • Student: It is a base model that provides pseudo-labels for all instances.
  • RAN teacher: The Rule Attention Teacher Network aggregates the predictions of multiple weak sources with instance-specific weights to compute a single pseudo-label for each instance.

ConWea: This framework provides contextualised weak supervision for text classification. The contextualised representations of word occurrences and seed word information are used to automatically differentiate multiple interpretations of the same word. It helps in creating a contextualised corpus which is further used to train the classifier and expand seed words in an iterative manner. This framework offers a fully-contextualised weak supervision process.

Snuba: With weak supervision, users have to rely on imperfect sources of labels like pattern matching and user-defined heuristics. Snuba is a system that generates heuristics that labels the subset of the data and iteratively repeats this process till a large portion of the unlabelled data is covered. Snuba can automatically generate heuristics in under five minutes and outperform most user-defined heuristics.

More Great AIM Stories

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

What is Direct to Mobile technology?

The Department of Technology is conducting a feasibility study of a spectrum band for offering broadcast services directly to users’ smartphones.