10 Datasets Open-Sourced By Tech Giants In 2019


Open-source projects have become one of the robust ways to enhance the quality of the projects. According to the 2018 Open Source Program Management survey by Linux Foundation, open-source projects are set to be the best practice for organisations in the field of technology, telecom, finance, among others. In one of our articles, we had discussed how tech giants are advocating open-source software as a vehicle of change.  

In this article, we list down 10 datasets which have been open-sourced by tech giants in 2019.


Sign up for your weekly dose of what's up in emerging technology.

Note: The list is in alphabetical order

1| Coached Conversational Preference Elicitation (CCPE) and Taskmaster-1 By Google

In September, Google released two natural language dialogue datasets known as Coached Conversational Preference Elicitation (CCPE) and Taskmaster-1. The CCPE is in English dialogue dataset which consists of 502 English dialogues with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. While the Taskmaster-1 dataset includes 13,215 task-based dialogues comprising of six domains.

Know more about this dataset here.

2| CheXpert Dataset By Stanford ML

In January, machine learning researchers from Stanford University have released CheXpert (Chest eXpert) which is a large chest radiograph dataset with certainty levels and expert comparison. The dataset contains 224,316 chest radiographs of 65,240 patients labelled for the presence of 14 common chest radiographic observations.

Know more about this dataset here.

3| Driving Dataset By Waymo

In August, the Alphabet’s autonomous driving subsidiary, Waymo open-sourced a high-quality multimodal sensor dataset for autonomous driving. The dataset is extracted from Waymo self-driving vehicles and covers a wide variety of environments, from dense urban centres to suburban landscapes. It contains 1000 types of different segments where each segment captures 20 seconds of continuous driving, corresponding to 200,000 frames at 10 Hz per sensor.

Know more about this dataset here.

4| Deepfake Detection Challenge

In September, Facebook, the partnership on AI, with Microsoft and academics from Cornell Tech, MIT, University of Oxford, UC Berkeley, University of Maryland, College Park, and University built the Deepfake Detection Challenge (DFDC). The researchers released four groups of datasets associated with the challenge which are the training set, public validation set, public test set, and private test set. 

Know more about this dataset here.

5| Diversity in Faces Dataset By IBM

In January, the big blue released Diversity in Faces (DiF) dataset which helps in the advancement of the study of fairness and accuracy in facial recognition technology. The dataset provides data of annotations of 1 million human facial images. Using publicly available images from the YFCC-100M creative commons data set, the researchers annotated the faces using 10 well-established and independent coding schemes.  

Know more about this dataset here.

6| Landmarks-v2 By Google

In May, the tech giant released Google-Landmarks-v2 which is one of the largest world-wide landmark recognition datasets. It includes over 5 million images (2x that of the first release) of more than 200 thousand different landmarks. 

Know more about this dataset here.

7| Level 5 Dataset By Lyft

In July, Lyft released a Level 5 dataset. Level 5 is a large-scale dataset featuring the raw sensor camera and LiDAR inputs as perceived by a fleet of multiple, high-end, autonomous vehicles in a bounded geographic area. It includes over 55,000 human-labelled 3D annotated frames, a drivable surface map, and an underlying HD spatial semantic map to contextualize the data.

Know more about this dataset here

8| Libri-Light By Facebook AI

In December, Facebook AI Research released the Libri-Light dataset which is a collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It contains over 60K hours of audio which is derived from open-source audiobooks from the LibriVox project.

Know more about this dataset here

9| Natural Questions for Question-Answering Systems By Google

Released by Google in January, Natural Questions (NQ) large-scale corpus for training and evaluating open-domain question answering systems, and the first to replicate the end-to-end process in which people find answers to questions. The dataset consists of 300,000 naturally occurring questions, along with human-annotated answers from Wikipedia pages, to be used in training QA systems. 

Know more about this dataset here.

10| Open Images V5 By Google

In May, Google open-sourced Open Images v5 dataset which features segmentation masks for 2.8 million object instances in 350 categories. The size of the training set is 2.68 million where the segmentation masks on the training set have been produced by the interactive segmentation process.  
Know more about this dataset here.

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM