“If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team.”Andrew Ng
The progress in machine learning progress owes a lot to teams downloading models and trying to do better on standard benchmark data sets. The bulk of the time is spent on improving the code, the model or the algorithms. “What I’m finding is that for a lot of problems, it’d be useful to shift our mindset toward not just improving the code but in a more systematic way of improving the data,” said Andrew Ng
Last week, Andrew Ng drew the ML community’s attention towards MLOps, a field dealing with building and deploying machine learning models more systematically. Andrew Ng explained how machine learning development could accelerate if more emphasis is on being data-centric than model-centric. Traditional software is powered by code, whereas AI systems are built using both code (models + algorithms) and data. “When a system isn’t performing well, many teams instinctually try to improve the code. But for many practical applications, it’s more effective instead to focus on improving the data,” he said.
Progress in machine learning, says Andrew Ng, has been driven by efforts to improve performance on benchmark datasets. The common practice amongst researchers is to hold the data fixed while trying to improve the code. But, when the dataset size is modest (<10,000 examples), Andrew Ng suggests ML teams will make faster progress, given the dataset is good.
Improving code vs improving data quality (Source: Deeplearning.AI)
It is commonly assumed that 80 percent of machine learning is data cleaning. If 80 percent of our work is data preparation, asks Andrew Ng, then why are we not ensuring data quality is of the utmost importance for a machine learning team.
Andrew Ng mentioned how everyone jokes about ML is 80% data preparation, but no one seems to care. A quick look at the arxiv would give an idea of the direction ML research is going. There is unprecedented competition around beating the benchmarks. If Google has BERT then OpenAI has GPT-3. But, these fancy models take up only 20% of a business problem. What differentiates a good deployment is the quality of data; everyone can get their hands on pre-trained models or licensed APIs.
Source: Paper by Paleyes et al.,
According to a study done by Cambridge researchers, the most important yet often ignored problem is data dispersion. The problem arises when data is streamed from different sources, which may have different schemas, different conventions, and their way of storing and accessing the data. Now, this is a tedious process for the ML engineers to combine the information into a single dataset suitable for machine learning.
While smaller datasets have troubles with noisy data, larger volumes of data can make labelling difficult. Access to experts can be another bottleneck for collecting high-quality labels. According to experts, lack of access to high-variance data is one of the main challenges when deploying machine learning solutions from the lab environment to the real world.
A consumer software internet company with many users has a data set of a lot of training examples. Imagine deploying AI in a different setting, such as agriculture or healthcare, where there aren’t enough data points. You cannot expect to have a million tractors!
So, here are a few thumb rules Andrew Ng has proposed to help deploy ML efficiently:
- The most important task of MLOps is to make high-quality data available.
- Labelling consistency is key. For example, check how your labellers are using the bounding boxes. There can be multiple ways of labelling, and even if they are good on their own, lack of consistency can deteriorate the outcome.
- Systematic improvement of data quality on a basic model is better than chasing the state-of-the-art models with low-quality data.
- In case of errors during training, take a data-centric approach.
- With data centric view, there is significant room for improvement in problems with smaller datasets (<10k examples).
- When working with smaller datasets, tools and services to promote data quality are critical.
According to Andrew Ng, good data is defined consistently, covers all edge cases, has timely feedback from production data and is sized appropriately. He advised against counting on engineers to chance upon the best way to improve a dataset. Instead, he hopes the ML community will develop MLOps tools that help make high-quality datasets and AI systems that are repeatable and systematic. He also said MLOps is a nascent field, and going forward, the most important objective of the MLOps teams should be to ensure a high-quality and consistent flow of data throughout all stages of a project.