The 35th edition of NeurIPS (Neural Information Processing Systems), one of the world’s most prestigious industry and academic gatherings was recently concluded. NeurIPS 2021 received 9,122 submissions, of which 2,344 were accepted. Twenty-six per cent of papers were accepted (with 3 per cent designated as spotlight papers), a slight increase from last year and the highest since 2013.
One of the highlights of this year’s conference was the introduction of a new award category – Dataset and Benchmark track. Under this category, two papers were awarded.
Idea behind announcing a new category
NeurIPS wrote in a blog that the Datasets and Benchmarks track would act as a novel venue for high-quality publications and talks on pertinent topics of valuable ML datasets and benchmarks. It would also serve as a forum for discussions on how to improve dataset development. Datasets and benchmarks are important for the development of machine learning methods but require their own reviewing guidelines. They also require additional specific checks like a proper description of the collected data on parameters like accessibility and bias. The submission to this track was reviewed according to a set of criteria that were designed specifically for datasets and benchmarks.
The following two papers were recognised in the new category of Datasets & Benchmarks Best Paper Awards:
Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research
This paper was published by a group of researchers from the University of California, Los Angeles, and Google Research. This paper explored the use of datasets within different machine learning subcommunities and the interaction between dataset adoption and creation. It calls for researchers to select benchmark datasets with greater care and promote the creation of new and more diverse datasets.
This paper found that despite the foundational role of benchmarking practices in ML research, little attention has been paid to benchmark dataset use and reuse dynamics. The researchers studied how the usage patterns differ across ML subcommunities between 2015-2020. They found that the increasing concentration on fewer datasets within task communities, adoption of datasets from other tasks, and concentration across the field on datasets that have been introduced by researchers situated within a small number of elite institutions,” the scientists noted. The result of this study can be used for scientific evaluation, AI ethics, and equity/access within the field.
ATOM3D Tasks on Molecules in Three Dimensions
The ATOM3D database contains datasets that describe the three-dimensional structure of biomolecules, including proteins, small molecules, and nucleic acids. They represent a variety of important structural, functional, and engineering challenges and serve as a benchmark for machine learning methods that operate on molecular structure. A Python package is also provided with all datasets, including processing code, utilities, models, and data loaders for common machine learning frameworks such as PyTorch. ATOM3D’s datasets are updated as the field progresses, and tasks are added according to the project’s needs.
At the moment, Atom3D contains eight datasets, which can roughly be categorised into four sections that cover a wide range of problems ranging from single molecular structures to interactions between biomolecules and molecular functional properties and design/engineering tasks.