Last updated September 2, 2019
In AI Origins & Evolution

Why You Should Use Weak Supervision For Your Data Labelling Chores

Published on March 25, 2019

by Ram Sagar

If a model is tasked with identifying dogs from a random collection, then one can use named-entity recognition to label any content that does not contain the names of dog breeds. Existing knowledge resources can be combined with such simplistic logic to label training data for using it in a new model.

This labelling function returns ‘None’ more often, which leads to only a few parts of data getting labelled.

Any organisation that looks towards machine learning will have to deal with the challenges that come with data labelling. The need for hand-labelled training datasets becomes obvious right at the beginning of the pipeline.

Data labelling requires the following:

Collecting labels
Developing label instructions
Training subject matter experts to carry those instructions
Deal with failing training datasets with evolving applications

In order to generate high quality-correlated labels, the researchers at Google introduce Snorkel Drybell, a framework which uses generative modeling technique.

In industry and other domains there has been an increased affinity towards programmatic or otherwise more efficient but noisier ways of generating training labels, often referred to as weak supervision.

Snorkel Drybell, adapts the open-source Snorkel framework to use diverse organizational knowledge resources like internal models, ontologies, knowledge graphs to generate training data for machine learning models at web scale.

Snorkel DryBell, integrates with Google ’s distributed production develops weak supervision strategies over millions of examples in less than thirty minutes.

In this technique, unlike the previous weakly supervised models, an effort is made to build complete systems which can manage multiple sources of weak supervision that take in diverse accuracies and correlations.

Snorkel DryBell enables writing labelling functions that label training data programmatically.

This technique automatically estimates the accuracies and correlation consistently without any ground truth training labels.

This framework makes use of the following information:

Heuristics and rules
Taggers and classifiers
Aggregate statistics
Entity graphs

This information is then used to write labelling functions in a MapReduce based pipeline.

Each labelling function takes in a data point and either gives a label or just stays silent.

To achieve accuracy, noisy labels need to be handled and to do this Snorkel DryBell combines the outputs from the labelling functions into a single, confidence-weighted training label for each data point.

And, to evaluate the topic and product classification, the training labels estimated by this framework are used to train logistic regression classifiers with features similar to those in production.

Training is done using the FTLR optimisation algorithm, which is a variant of the stochastic gradient descent. The initial step size here is 0.2 and trains over 10,000 iterations for topic classification task and more than 100K for the product classification task.

Key Findings

Users can write labelling functions over on unservable feature set and then use the output to train model over different servable feature set.
This feature boosts performance by an average of 52% on the benchmark datasets.
Efficient and inexpensive deployment of models.
A new kind of transfer learning- transferring domain knowledge between different feature sets.

The transfer of knowledge between feature sets has great potential in medical applications where the datasets are large and vague.

There are other challenges that surface while dealing with large data such as data fusion and truth discovery. The aim of data cleaning is to identify and rectify the errors in the datasets. Generative models used in Snorkel DryBell framework can be used for the aforementioned data cleaning challenges.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.

Looking for a break in AI & ML? These companies are offering internships

Fair labelling of data does not always result in fair predictions

Data annotation career: Scope, opportunities and salaries

Is AI fast becoming a technology built on worker exploitation from Global South?

Key Job Roles In The Upcoming Field Of Data Labelling

Top Data Labelling Courses

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

India is Making its Own AI Servers

Pritam Bordoloi

PLI scheme marks the beginning of India ‘s manufacturing venture

GPT-5 Likely to be Released After the US Elections

Donna Eva

Generative AI Jobs in India can Fetch You up to Rs 1 Crore

Siddharth Jindal

Top Editorial Picks

Elon Musk Set to Meet Indian Spacetech Startups During Upcoming Visit

Shyam Nandan Upadhyay

Happiest Minds Technologies Acquires Macmillan Learning India, Expands Edutech Reach

Shritama Saha

Meta Releases Llama 3, Beats Claude 3 Sonnet and Gemini Pro 1.5

Mohit Pandey

Nothing Becomes the First Smartphone Company to Integrate OpenAI’s ChatGPT

Siddharth Jindal

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Featured

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Through the implementation of advanced data management methodologies, resilient data observability solutions, and cutting-edge AI frameworks, Course5 is spearheading the