Big Data To Good Data: Andrew Ng Urges ML Community To Be More Data-Centric And Less Model-Centric

“If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team.”

“If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team.”

Andrew Ng

The progress in machine learning progress owes a lot to teams downloading models and trying to do better on standard benchmark data sets. The bulk of the time is spent on improving the code, the model or the algorithms. “What I’m finding is that for a lot of problems, it’d be useful to shift our mindset toward not just improving the code but in a more systematic way of improving the data,” said Andrew Ng

Last week, Andrew Ng drew the ML community’s attention towards MLOps, a field dealing with building and deploying machine learning models more systematically. Andrew Ng explained how machine learning development could accelerate if more emphasis is on being data-centric than model-centric. Traditional software is powered by code, whereas AI systems are built using both code (models + algorithms) and data. “When a system isn’t performing well, many teams instinctually try to improve the code. But for many practical applications, it’s more effective instead to focus on improving the data,” he said.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Progress in machine learning, says Andrew Ng, has been driven by efforts to improve performance on benchmark datasets. The common practice amongst researchers is to hold the data fixed while trying to improve the code. But, when the dataset size is modest (<10,000 examples), Andrew Ng suggests ML teams will make faster progress, given the dataset is good.

Improving code vs improving data quality (Source: Deeplearning.AI)


Download our Mobile App



It is commonly assumed that 80 percent of machine learning is data cleaning. If 80 percent of our work is data preparation, asks Andrew Ng, then why are we not ensuring data quality is of the utmost importance for a machine learning team.

Andrew Ng mentioned how everyone jokes about ML is 80% data preparation, but no one seems to care. A quick look at the arxiv would give an idea of the direction ML research is going. There is unprecedented competition around beating the benchmarks. If Google has BERT then OpenAI has GPT-3. But, these fancy models take up only 20% of a business problem. What differentiates a good deployment is the quality of data; everyone can get their hands on pre-trained models or licensed APIs.

Source: Paper by Paleyes et al., 

According to a study done by Cambridge researchers, the most important yet often ignored problem is data dispersion. The problem arises when data is streamed from different sources, which may have different schemas, different conventions, and their way of storing and accessing the data. Now, this is a tedious process for the ML engineers to combine the information into a single dataset suitable for machine learning.

While smaller datasets have troubles with noisy data, larger volumes of data can make labelling difficult. Access to experts can be another bottleneck for collecting high-quality labels. According to experts, lack of access to high-variance data is one of the main challenges when deploying machine learning solutions from the lab environment to the real world.

Source: Deeplearning.AI

A consumer software internet company with many users has a data set of a lot of training examples. Imagine deploying AI in a different setting, such as agriculture or healthcare, where there aren’t enough data points. You cannot expect to have a million tractors!

So, here are a few thumb rules Andrew Ng has proposed to help deploy ML efficiently: 

  • The most important task of MLOps is to make high-quality data available.
  • Labelling consistency is key. For example, check how your labellers are using the bounding boxes. There can be multiple ways of labelling, and even if they are good on their own, lack of consistency can deteriorate the outcome. 
  • Systematic improvement of data quality on a basic model is better than chasing the state-of-the-art models with low-quality data.
  • In case of errors during training, take a data-centric approach.
  • With data centric view, there is significant room for improvement in problems with smaller datasets (<10k examples).
  • When working with smaller datasets, tools and services to promote data quality are critical.

According to Andrew Ng, good data is defined consistently, covers all edge cases, has timely feedback from production data and is sized appropriately. He advised against counting on engineers to chance upon the best way to improve a dataset. Instead, he hopes the ML community will develop MLOps tools that help make high-quality datasets and AI systems that are repeatable and systematic. He also said MLOps is a nascent field, and going forward, the most important objective of the MLOps teams should be to ensure a high-quality and consistent flow of data throughout all stages of a project.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.