Andrew Ng Announces The Launch Of NeurIPS Data-Centric AI Workshop

If 80 per cent of machine learning work is data preparation, then ensuring data quality is the most important work of a machine learning team
Andrew Ng Announces The Launch Of NeurIPS Data-Centric AI Workshop

DeepLearning.AI’s Andrew Ng recently announced the launch of the NeurIPS Data-Centric AI workshop. The workshop is expected to showcase some of the best academic research work related to data-centric AI. Academic researchers and practitioners can submit their research papers on or before September 30, 2021.

The organising committee includes Google Research’s Lora Aroyo, Stanford University professor Cody Coleman, Landing AI’s Greg Diamos, Harvard University professor Vijay Janapa Reddi, Eindhoven University of Technology researcher Joaquin Vanschoren, and Google’s machine learning product manager, Sharon Zhou

What is Data-Centric AI

Data-Centric AI, or DCAI, represents the recent transition from modelling to the underlying data used to train and evaluate models. DCAI aims to address the gap in tooling, best practices, and infrastructure for managing data in modern ML systems. Plus, it looks to offer high productivity and efficient open data engineering tools to make building, maintaining, and evaluating datasets cost-effective and seamless. 

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

The team strives to cultivate the DCAI community into a vibrant interdisciplinary field and tackle practical data problems with this event. The data problems include: data collection/generation, data labelling, data preprocess/augmentation, data quality evaluation, data debt, and data governance. The team believes that many of these areas are still in the early stages and hope to knit the gaps by bringing the ML community together. 

Call for Papers 

The journey of building and using datasets for AI systems is often artisanal — painstaking and expensive. The ML community lacks high productivity and efficient open data engineering tools. To accelerate creation and iteration, alongside increasing the efficiency of use and reuse by democratising data engineering and evaluation, remains a core challenge even to this day. 

Download our Mobile App

“If 80 per cent of machine learning work is data preparation, then ensuring data quality is the most important work of an ML team and therefore a vital research area,” said the NeurIPS DSAI team. Further, they said human-labelled data has increasingly become the fuel and compass of AI-based software systems, while innovative efforts have mostly focused on models and code. 

However, in recent years, there has been an increased focus on scale, speed, and cost of building and improving datasets, which has, in turn, resulted in an impact on quality. Some of the major research work in the areas include ‘Response-based Learning for Grounded Machine Translation,’ ‘Crowdsourcing with Fairness, Diversity and Budget Constraints,’ Excavating AI, ‘Data Excellence: Better Data for Better AI,’ ‘State of the Art: Reproducibility in Artificial Intelligence, and others. 

“We need a framework for excellence in ‘data engineering’ that does not yet exist,” said the NeurIPS DCAI team, and noted that aspects like maintainability, reproducibility, reliability, validity and fidelity of datasets are often overlooked when releasing the dataset into the market. In this event, the team plans to highlight examples, case studies, and methodologies for excellence in data collection. 

The NeurIPS DCAI team said that building an active research community focused on data-centric AI is critical for defining the core problems and creating ways to measure progress in machine learning through data quality tasks.


The interested candidate can submit their papers on the following topics that include but are not limited to the following: 

New Datasets in areas: 

  • Speech, vision, manufacturing, medical, recommendation/personalisation 
  • Science 

Tools and methodologies that

  • Quantify and accelerate time to source high-quality data 
  • Ensure data is labelled consistently, such as label consensus 
  • Improve data quality more systematically. 
  • Automate the generation of high quality supervised learning training data from low-quality resources, such as forced alignment in speech recognition 
  • Produce uniform and low noise data samples, or remove labelling noise or inconsistencies from existing data 
  • Control what goes into the dataset and make high-level edits efficiently to very large datasets, like adding new words, languages, etc. 
  • Search techniques for finding suitably licensed datasets based on public resources 
  • Create training datasets for small data problems or rare classes in the long tail of big data problems 
  • Incorporate timely feedback from production systems into datasets 
  • Understand dataset coverage of important classes and editing them to cover newly identified important cases 
  • Import dataset by allowing easy combination and composition of existing datasets
  • Export dataset by making the data consumable for models and interface with model training and inference systems such as web dataset 
  • Enable composition of dataset tools like MLCube, Docker, Airflow 

Algorithms for working with limited labelled data and improving label efficiency 

  • Data selection techniques like active learning and core-set selection for identifying the most valuable examples to label 
  • Semi-supervised learning, few-shot learning, and weak supervision techniques for maximising the power of limited labelled data 
  • Self-supervised learning and transfer learning approaches for developing powerful representations used for many downstream tasks with limited labelled data
  • Novelty and drift detection to identify and spot when more data needs to be labelled 

Responsible AI development: 

  • Fairness, bias, diversity evaluation and analysis for dataset and algorithms/modelling 
  • Tools for ‘green AI hardware-software system’ design and evaluation 
  • Scalable, reliable training systems and methods 
  • Tools, methodologies, and techniques for private, secure ML training 
  • Efforts towards reproducible AI (data cards, model cars, etc.)

Instructions for submitting papers

  • Researchers can submit short papers (1-2 pages) and long papers (4 pages), addressing one or more of the topics  
  • Papers need to be formatted as per NeurIPS 2021 guidelines 
  • Papers will be peer-reviewed by the programme committee 
  • Accepted papers will be presented as lighting talks during the workshop 


  • Early submission deadline: 17 September 2021
  • Submission deadline: 30 September 2021 
  • Notification of acceptance: 22 October 2021 
  • Workshop: 14 December 2021 

Click here to submit your research paper. 

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Amit Raja Naik
Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.

Our Upcoming Events

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023

21 Jul, 2023 | New York
MachineCon USA 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

The Great Indian IT Reshuffling

While both the top guns of TCS and Tech Mahindra are reflecting rather positive signs to the media, the reason behind the resignations is far more grave.

OpenAI, a Data Scavenging Company for Microsoft

While it might be true that the investment was for furthering AI research, this partnership is also providing Microsoft with one of the greatest assets of this digital age, data​​, and—perhaps to make it worse—that data might be yours.