Best Practices On Setting Up Development And Test Sets For ML, According To Andrew Ng

The availability of data and increased computational power have been the biggest drivers of artificial intelligence. Google’s TensorFlow played a huge role in revolutionising machine learning as it allows developers to build neural networks without knowing all the functionality. It supports multiple languages, so developers can create the ML models in Python and use them easily in other languages as well.

This article is based on Andrew Ng’s free ebook Machine Learning Yearning where he gives technical direction for machine learning projects. One of the key aspects he discusses is about setting up the development and test sets.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

In the book, Ng discusses what happens when a team decides to deploy a classifier in the app and tests the performance based on the data collected. For example, you download a large training set by downloading pictures of cats (positive examples) and non-cats (negative examples) from different websites. The dataset is further split into 70 percent to 30 percent – training and test sets. Using this data, one builds a cat detector which works well on the training and test sets. But when this classifier is deployed into a mobile app, the performance doesn’t fare well.

Setting Up Development And Test Sets

Ng emphasises that working on machine learning applications is hard enough but having mismatched development and test sets add to the uncertainty about whether improving on the development set distribution also improves test set performance. As a lesson for beginners, he states that having mismatched development and test sets can make it harder to figure out what is and isn’t working.

Ng affirms that it is an important research problem to develop learning algorithms that are trained on one distribution and generalise well to another. But if your goal is to make progress on a specific machine learning application rather than make research progress, he recommends choosing development and test sets that are drawn from the same distribution.

How Large Should The Development/Tests Sets Be?

The development set should be large enough to detect differences between algorithms that one is working on, states Ng. He cites an example – if classifier A has an accuracy of 90.0% and classifier B has an accuracy of 90.1%, then a development set of 100 examples would not be able to detect this 0.1% difference. Compared to other machine learning problems, a 100-example development set is small. Development sets with sizes from 1,000 to 10,000 examples are common. You stand a good chance of detecting an improvement of up to 0.1% when the set features 10,000 examples.

For mature and important applications like , advertising, web search and product recommendations the former Baidu and Google chief talks about teams that are highly motivated to eke out even a 0.01% improvement, since it has a direct impact on the company’s profits. In this case, the development set could be much larger than 10,000, in order to pick up even the smallest of improvements.

What should be the size of the test set? It should be large enough to give high confidence in the overall performance of the system. One popular heuristic had been to use 30% of your data for your test set. This works well when you have a modest number of examples — say 100 to 10,000 examples. But now in the age of big data, where we now have machine learning problems with sometimes more than a billion examples, the fraction of data allocated to dev/test sets has been shrinking, even as the absolute number of examples in the development or test sets has been growing. Ng emphasises that this eliminates the need to have excessively large development or test sets beyond what is needed to evaluate the performance of your algorithms.

Ng Recommends That Teams Should:

  • Choose development and test sets to reflect data that approximates your expectation
  • The test set should not simply be 30% of the available data, especially if one wants the future data (mobile phone images) to be different in nature from the training set (website images)
  • The development and test sets should ideally be large enough to represent accurately the performance of the model

When discussing best practices on splitting test and development datasets, Stanford tutorial discusses that academic datasets often come with a train/test split (to be able to compare different models on a common test set). You will therefore have to build yourself the train/development split before beginning your project.

Data Collection

Another key tip is that as part of the machine learning strategy, teams should define the data collection process. If teams know what they want to predict, it will help them outline what data needs to be mined. By and large, the general recommendation for beginners is to reduce the complexity of data by understanding exactly what type of data needs to be harnessed. For example, most business problems can be solved with a simple segmentation, so it is important to know tasks/business problem and understand the right algorithm for it. For example, ML algorithms fall into five major categories: cluster analysis, classification, ranking, regression and generation. So, segmenting audience falls under cluster analysis.


Richa Bhatia
Richa Bhatia is a seasoned journalist with six-years experience in reportage and news coverage and has had stints at Times of India and The Indian Express. She is an avid reader, mum to a feisty two-year-old and loves writing about the next-gen technology that is shaping our world.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox