MITB Banner

Why You Should Use Weak Supervision For Your Data Labelling Chores

Share

If a model is tasked with identifying dogs from a random collection, then one can use named-entity recognition to label any content that does not contain the names of dog breeds. Existing knowledge resources can be combined with such simplistic logic to label training data for using it in a new model.

This labelling function returns ‘None’ more often, which leads to only a few parts of data getting labelled.

Any organisation that looks towards machine learning will have to deal with the challenges that come with data labelling. The need for hand-labelled training datasets becomes obvious right at the beginning of the pipeline.

Data labelling requires the following:

  • Collecting labels
  • Developing label instructions
  • Training subject matter experts to carry those instructions
  • Deal with failing training datasets with evolving applications

In order to generate high quality-correlated labels, the researchers at Google introduce Snorkel Drybell, a framework which uses generative modeling technique.

In industry and other domains there has been an increased affinity towards programmatic or otherwise more efficient but noisier ways of generating training labels, often referred to as weak supervision.

Snorkel Drybell, adapts the open-source Snorkel framework to use diverse organizational knowledge resources like internal models, ontologies, knowledge graphs to generate training data for machine learning models at web scale.

Snorkel DryBell, integrates with Google ’s distributed production develops weak supervision strategies over millions of examples in less than thirty minutes.

In this technique, unlike the previous weakly supervised models, an effort is made to build complete systems which can manage multiple sources of weak supervision that take in diverse accuracies and correlations.

Snorkel DryBell enables writing labelling functions that label training data programmatically.

This technique automatically estimates the accuracies and correlation consistently without any ground truth training labels.

Source: Google AI

This framework makes use of the following information:

  • Heuristics and rules
  • Taggers and classifiers
  • Aggregate statistics
  • Entity graphs

This information is then used to write labelling functions in a MapReduce based pipeline.

Each labelling function takes in a data point and either gives a label or just stays silent.

To achieve accuracy, noisy labels need to be handled and to do this Snorkel DryBell combines the outputs from the labelling functions into a single, confidence-weighted training label for each data point.

And, to evaluate the topic and product classification, the training labels estimated by this framework are used to train logistic regression classifiers with features similar to those in production.

Training is done using the FTLR optimisation algorithm, which is a variant of the stochastic gradient descent. The initial step size here is 0.2 and trains over 10,000 iterations for topic classification task and more than 100K for the product classification task.

Key Findings

  • Users can write labelling functions over on unservable feature set and then use the output to train model over different servable feature set.
  • This feature boosts performance by an average of 52% on the benchmark datasets.
  • Efficient and inexpensive deployment of models.
  • A new kind of transfer learning- transferring domain knowledge between different feature sets.

The transfer of knowledge between feature sets has great potential in medical applications where the datasets are large and vague.

There are other challenges that surface while dealing with large data such as data fusion and truth discovery. The aim of data cleaning is to identify and rectify the errors in the datasets.  Generative models used in Snorkel DryBell framework can be used for the aforementioned data cleaning challenges.

PS: The story was written using a keyboard.
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed