MITB Banner

Weak Supervision: The Art Of Training ML Models From Noisy Data

The most common approaches in machine learning are supervised and unsupervised learning.
Share
weak supervision

A deep learning model’s performance gets better as the size of the dataset increases. However, there is a catch; deep learning models have hundreds of millions of parameters. Meaning, the models require a large amount of labelled data.

Hand-labelled training sets are expensive and time consuming (from months to years) to create. Some datasets call for domain expertise (eg: medical-related datasets). More often than not, such labelled datasets cannot be even repurposed for new objectives. Given the associated costs and inflexibility of hand-labelling, training sets pose a big hurdle in deploying machine learning models.

Enter weak supervision, a branch of machine learning where limited and imprecise sources can be used to label large amounts of training data in a supervised setting. In this approach, inexpensive weak labels are used to create a strong predictive model.

Weak supervision

The most common approaches in machine learning are supervised and unsupervised learning. However, there is a whole spectrum of supervision between the two extremes. Weak supervision lies between fully supervised learning and semi-supervised learning. It can be described as an approach that uses data with noisy labels. These labels are usually generated by a computer by applying heuristics to a signal with the unlabelled data to develop their label.

In its most common form, ML practitioners need a small amount of labelled training data and a large amount of unlabelled data for weak supervision. The goal is to create labels for the unlabeled data so that it can be used to train the model. However, there are two prerequisites: it is necessary that the unlabelled data must contain relevant information and secondly, the developer must generate enough correctly labelled data that it overcomes the noise generated by the weak supervision approach.

There are three types of weak supervision:

Incomplete supervision: Only a small subset of training data is given labels and the other remains unlabelled

Inexact supervision: Only coarse-grained labels are given.

Inaccurate supervision: The given labels may or may not be the groundtruth. It usually happens when the annotator is careless or the data is too difficult to be correctly categorised.

Weak supervision frameworks

Some of the most significant weak supervision frameworks are:

Snorkel: It is an open-source weak supervision framework by Stanford University’s team. Using a small amount of labelled data and a large amount of unlabelled data, Snorkel allows users to write labelling functions in Python for multiple dataset signals. Multiple weak signals from labelled and labelling function-generated labelled data are then used to train a generative model. This model is used to produce probabilistic labels that can in turn train the target model.

Credit: Google AI

ASTRA: It is a weak supervision framework for training deep neural networks. It uses automatically generated weakly labelled data for tasks where collecting large-scale labelled training data is expensive option. ASTRA employs a teacher-student architecture and leverages domain-specific rules, a large amount of labelled data, and a small amount of labelled data. The key components of this framework are:

Credit: ASTRA/Github

  • Weak rules: Expressed as Python-labelling functions, these are domain-specific rules that rely on heuristics for annotating text instances with weak labels.
  • Student: It is a base model that provides pseudo-labels for all instances.
  • RAN teacher: The Rule Attention Teacher Network aggregates the predictions of multiple weak sources with instance-specific weights to compute a single pseudo-label for each instance.

ConWea: This framework provides contextualised weak supervision for text classification. The contextualised representations of word occurrences and seed word information are used to automatically differentiate multiple interpretations of the same word. It helps in creating a contextualised corpus which is further used to train the classifier and expand seed words in an iterative manner. This framework offers a fully-contextualised weak supervision process.

Snuba: With weak supervision, users have to rely on imperfect sources of labels like pattern matching and user-defined heuristics. Snuba is a system that generates heuristics that labels the subset of the data and iteratively repeats this process till a large portion of the unlabelled data is covered. Snuba can automatically generate heuristics in under five minutes and outperform most user-defined heuristics.

PS: The story was written using a keyboard.
Share
Picture of Shraddha Goled

Shraddha Goled

I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India