Now Reading
Weak Supervision: The Art Of Training ML Models From Noisy Data

Weak Supervision: The Art Of Training ML Models From Noisy Data

  • The most common approaches in machine learning are supervised and unsupervised learning.
weak supervision

A deep learning model’s performance gets better as the size of the dataset increases. However, there is a catch; deep learning models have hundreds of millions of parameters. Meaning, the models require a large amount of labelled data.

Hand-labelled training sets are expensive and time consuming (from months to years) to create. Some datasets call for domain expertise (eg: medical-related datasets). More often than not, such labelled datasets cannot be even repurposed for new objectives. Given the associated costs and inflexibility of hand-labelling, training sets pose a big hurdle in deploying machine learning models.

Register for Data & Analytics Conclave>>

Enter weak supervision, a branch of machine learning where limited and imprecise sources can be used to label large amounts of training data in a supervised setting. In this approach, inexpensive weak labels are used to create a strong predictive model.

Weak supervision

The most common approaches in machine learning are supervised and unsupervised learning. However, there is a whole spectrum of supervision between the two extremes. Weak supervision lies between fully supervised learning and semi-supervised learning. It can be described as an approach that uses data with noisy labels. These labels are usually generated by a computer by applying heuristics to a signal with the unlabelled data to develop their label.

In its most common form, ML practitioners need a small amount of labelled training data and a large amount of unlabelled data for weak supervision. The goal is to create labels for the unlabeled data so that it can be used to train the model. However, there are two prerequisites: it is necessary that the unlabelled data must contain relevant information and secondly, the developer must generate enough correctly labelled data that it overcomes the noise generated by the weak supervision approach.

There are three types of weak supervision:

Incomplete supervision: Only a small subset of training data is given labels and the other remains unlabelled

Inexact supervision: Only coarse-grained labels are given.

Inaccurate supervision: The given labels may or may not be the groundtruth. It usually happens when the annotator is careless or the data is too difficult to be correctly categorised.

Weak supervision frameworks

Some of the most significant weak supervision frameworks are:

Snorkel: It is an open-source weak supervision framework by Stanford University’s team. Using a small amount of labelled data and a large amount of unlabelled data, Snorkel allows users to write labelling functions in Python for multiple dataset signals. Multiple weak signals from labelled and labelling function-generated labelled data are then used to train a generative model. This model is used to produce probabilistic labels that can in turn train the target model.

See Also

Credit: Google AI

ASTRA: It is a weak supervision framework for training deep neural networks. It uses automatically generated weakly labelled data for tasks where collecting large-scale labelled training data is expensive option. ASTRA employs a teacher-student architecture and leverages domain-specific rules, a large amount of labelled data, and a small amount of labelled data. The key components of this framework are:

Credit: ASTRA/Github

  • Weak rules: Expressed as Python-labelling functions, these are domain-specific rules that rely on heuristics for annotating text instances with weak labels.
  • Student: It is a base model that provides pseudo-labels for all instances.
  • RAN teacher: The Rule Attention Teacher Network aggregates the predictions of multiple weak sources with instance-specific weights to compute a single pseudo-label for each instance.

ConWea: This framework provides contextualised weak supervision for text classification. The contextualised representations of word occurrences and seed word information are used to automatically differentiate multiple interpretations of the same word. It helps in creating a contextualised corpus which is further used to train the classifier and expand seed words in an iterative manner. This framework offers a fully-contextualised weak supervision process.

Snuba: With weak supervision, users have to rely on imperfect sources of labels like pattern matching and user-defined heuristics. Snuba is a system that generates heuristics that labels the subset of the data and iteratively repeats this process till a large portion of the unlabelled data is covered. Snuba can automatically generate heuristics in under five minutes and outperform most user-defined heuristics.

What Do You Think?

Join Our Discord Server. Be part of an engaging online community. Join Here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top