Andrew Ng Announces The Launch Of NeurIPS Data-Centric AI Workshop

If 80 per cent of machine learning work is data preparation, then ensuring data quality is the most important work of a machine learning team
Andrew Ng Announces The Launch Of NeurIPS Data-Centric AI Workshop


DeepLearning.AI’s Andrew Ng recently announced the launch of the NeurIPS Data-Centric AI workshop. The workshop is expected to showcase some of the best academic research work related to data-centric AI. Academic researchers and practitioners can submit their research papers on or before September 30, 2021.

The organising committee includes Google Research’s Lora Aroyo, Stanford University professor Cody Coleman, Landing AI’s Greg Diamos, Harvard University professor Vijay Janapa Reddi, Eindhoven University of Technology researcher Joaquin Vanschoren, and Google’s machine learning product manager, Sharon Zhou


Sign up for your weekly dose of what's up in emerging technology.

What is Data-Centric AI

Data-Centric AI, or DCAI, represents the recent transition from modelling to the underlying data used to train and evaluate models. DCAI aims to address the gap in tooling, best practices, and infrastructure for managing data in modern ML systems. Plus, it looks to offer high productivity and efficient open data engineering tools to make building, maintaining, and evaluating datasets cost-effective and seamless. 

The team strives to cultivate the DCAI community into a vibrant interdisciplinary field and tackle practical data problems with this event. The data problems include: data collection/generation, data labelling, data preprocess/augmentation, data quality evaluation, data debt, and data governance. The team believes that many of these areas are still in the early stages and hope to knit the gaps by bringing the ML community together. 

Call for Papers 

The journey of building and using datasets for AI systems is often artisanal — painstaking and expensive. The ML community lacks high productivity and efficient open data engineering tools. To accelerate creation and iteration, alongside increasing the efficiency of use and reuse by democratising data engineering and evaluation, remains a core challenge even to this day. 

“If 80 per cent of machine learning work is data preparation, then ensuring data quality is the most important work of an ML team and therefore a vital research area,” said the NeurIPS DSAI team. Further, they said human-labelled data has increasingly become the fuel and compass of AI-based software systems, while innovative efforts have mostly focused on models and code. 

However, in recent years, there has been an increased focus on scale, speed, and cost of building and improving datasets, which has, in turn, resulted in an impact on quality. Some of the major research work in the areas include ‘Response-based Learning for Grounded Machine Translation,’ ‘Crowdsourcing with Fairness, Diversity and Budget Constraints,’ Excavating AI, ‘Data Excellence: Better Data for Better AI,’ ‘State of the Art: Reproducibility in Artificial Intelligence, and others. 

“We need a framework for excellence in ‘data engineering’ that does not yet exist,” said the NeurIPS DCAI team, and noted that aspects like maintainability, reproducibility, reliability, validity and fidelity of datasets are often overlooked when releasing the dataset into the market. In this event, the team plans to highlight examples, case studies, and methodologies for excellence in data collection. 

The NeurIPS DCAI team said that building an active research community focused on data-centric AI is critical for defining the core problems and creating ways to measure progress in machine learning through data quality tasks.


The interested candidate can submit their papers on the following topics that include but are not limited to the following: 

New Datasets in areas: 

  • Speech, vision, manufacturing, medical, recommendation/personalisation 
  • Science 

Tools and methodologies that

  • Quantify and accelerate time to source high-quality data 
  • Ensure data is labelled consistently, such as label consensus 
  • Improve data quality more systematically. 
  • Automate the generation of high quality supervised learning training data from low-quality resources, such as forced alignment in speech recognition 
  • Produce uniform and low noise data samples, or remove labelling noise or inconsistencies from existing data 
  • Control what goes into the dataset and make high-level edits efficiently to very large datasets, like adding new words, languages, etc. 
  • Search techniques for finding suitably licensed datasets based on public resources 
  • Create training datasets for small data problems or rare classes in the long tail of big data problems 
  • Incorporate timely feedback from production systems into datasets 
  • Understand dataset coverage of important classes and editing them to cover newly identified important cases 
  • Import dataset by allowing easy combination and composition of existing datasets
  • Export dataset by making the data consumable for models and interface with model training and inference systems such as web dataset 
  • Enable composition of dataset tools like MLCube, Docker, Airflow 

Algorithms for working with limited labelled data and improving label efficiency 

  • Data selection techniques like active learning and core-set selection for identifying the most valuable examples to label 
  • Semi-supervised learning, few-shot learning, and weak supervision techniques for maximising the power of limited labelled data 
  • Self-supervised learning and transfer learning approaches for developing powerful representations used for many downstream tasks with limited labelled data
  • Novelty and drift detection to identify and spot when more data needs to be labelled 

Responsible AI development: 

  • Fairness, bias, diversity evaluation and analysis for dataset and algorithms/modelling 
  • Tools for ‘green AI hardware-software system’ design and evaluation 
  • Scalable, reliable training systems and methods 
  • Tools, methodologies, and techniques for private, secure ML training 
  • Efforts towards reproducible AI (data cards, model cars, etc.)

Instructions for submitting papers

  • Researchers can submit short papers (1-2 pages) and long papers (4 pages), addressing one or more of the topics  
  • Papers need to be formatted as per NeurIPS 2021 guidelines 
  • Papers will be peer-reviewed by the programme committee 
  • Accepted papers will be presented as lighting talks during the workshop 


  • Early submission deadline: 17 September 2021
  • Submission deadline: 30 September 2021 
  • Notification of acceptance: 22 October 2021 
  • Workshop: 14 December 2021 

Click here to submit your research paper. 

More Great AIM Stories

Amit Raja Naik
Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM