Most Popular Datasets Announced By Big Tech In 2021

Some notable players in creating datasets include Apple, Microsoft, Google, Amazon, Facebook and others.
Most Popular Datasets Announced By Big Tech In 2021

According to MillionInsights, the global AI training dataset market is projected to grow at 22.5 per cent CAGR by 2027, from $956.5 million in 2019. The growth is fueled by various data-driven applications, including voice recognition, image recognition, etc. In addition, the need for human and machine interaction is said to provide new growth opportunities for market players.

In 2020, nearly 33 per cent of the market share was catered by the text segment. This is due to the high use of text datasets in the IT sector for various automation processes such as text classification, caption generation, speech recognition, etc. Meanwhile, the audio segment is expected to hold a moderate share due to the availability of a wide range of audio datasets. These include speech datasets, music datasets, speech commands, multimodal emotion lines datasets, environmental audio datasets, etc. The image/video segment is expected to register massive growth in the coming years. 

Last year, Google launched a new AI training dataset called Google-landmarks-v2 that contains millions of images and thousands of landmarks. It also launched two challenges on Kaggle, landmark retrieval 2020, and namely landmark recognition 2020. These were launched for image retrieval and instance recognition and to train better and robust systems. 

To foster a data-sharing ecosystem that will encourage data publishers to follow best practices for data storage and publication and showcase datasets produced by scientists, Google has launched Google Dataset Search. The beta version was launched on January 23, 2020. Microsoft has also launched Microsoft Research Open Data, a collection of free datasets from Microsoft Research to advance research in NLP, computer vision, and domain-specific sciences. 

Several techniques like semi-supervised, self-supervised, and transfer learning have been developed in the last few years, shaping the training dataset ecosystem to the next level. In addition, big tech and other players are developing automated tools to help develop high-quality datasets to unleash the true potential of AI. Some notable players in creating datasets include Apple, Microsoft, Google, Amazon, Facebook, Scale AI, Sama, and others. 

Here’s a list of the most popular datasets announced by big tech in 2021.  



Hypersim is a photorealistic synthetic dataset for holistic indoor scene understanding. The dataset was created using a large repository of synthetic scenes created by professional artists. They generate 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry. 

The dataset: 

  • Releases on publicly available 3D assets 
  • Includes complete scene geometry, material information, and lightning information for every scene
  • Includes dense per-pixel semantic instance segmentation for every image 
  • Factors every image into diffuse reflectance, diffuse illumination, and non-diffuse residual term that captures view-dependent lighting effects. 

These features of the dataset make it well-suited for geometric learning problems that require direct 3D supervision, multi-task learning problems that require reasoning jointly over multiple input and output modalities, and inverse rendering problems. Interestingly, it is possible to generate an entire dataset from scratch for roughly half of the cost of training a SOTA natural language processing model. 

Check out the code and dataset here


Stuttering Events in Podcasts (SEP-28K) is a dataset containing over 28K clips labelled with five event types, including blocks, prolongations, sound repetitions, word repetitions, and interjections. The audio is collected from public podcasts largely consisting of people who stutter interviewing other people who stutter. 

Check out the code and dataset on GitHub


Indoor Location Dataset

Released along with Microsoft Indoor Location Competition 2.0, Indoor Location Dataset consists of dense indoor signatures of WiFi, geomagnetic field, iBeacons, etc., and ground truth collected by Android smartphones from hundreds of buildings in Chinese cities. The dataset can be used in the research and development of indoor space, including localisation and navigation. 

Check out the code and dataset on GitHub

Odia Speech Data and Model

In partnership with Navana Tech, Microsoft Research India is open-sourcing 1648 hours of validated Odia speech dataset and baseline model for Odia speech recognition. The dataset consists of recordings in agriculture, banking, and healthcare in four dialects of Odia collected from five different districts. 



Wikipedia-Based Image Text (WIT) dataset is a large multimodal dataset created by extracting multiple text selections associated with an image from Wikipedia. The dataset comprises a curated set of 37.6 million entity-rich image-text examples with 11.5 million unique images across 108 languages. 

Check out the dataset on GitHub

Translated Wikipedia Biographies

This dataset has been designed to analyse common gender errors in machine translation, such as incorrect gender choices in pro-drop, possessives, and gender agreement. 

Each instance of the dataset represents a person, a rock band or a sports team. A long text translation represents each entity. Articles are written in native English and have been professionally translated to Spanish and German. For Spanish, translations are optimised for pronoun-drop so that the same set could analyse pro-drop (Spanish to English) and gender agreement (English to Spanish). 


Room-Across-Room (RxR) is a new dataset for vision-and-language navigation (VLN). It is the first multilingual dataset for VLN, containing 126,069 human-annotated navigation instructions in three typologically diverse languages – English, Hindi, and Telugu. In the dataset, each instruction describes a path through a photorealistic simulator populated with an indoor environment from the Matterport3D dataset, which includes 3D captures of offices, homes, and public buildings. 

Check out the dataset on GitHub



Commonsense-Dialogues is a crowdsourced dataset of 11,000 dialogues grounded in social contexts involving the use of commonsense. The social contexts used were sourced/collected from the train split of the SocialIQA dataset, a multiple-choice question-answering based social commonsense reasoning benchmark. 

Each Turker was presented with a social context in this dataset and asked to write a dialogue of 4-6 turns between two people based on the events described in the context. In addition, the Turker was asked to alternate between an individual referenced in the context and a third-party friend. 

Check out the dataset here


Amazon Berkeley Objects (ABO) dataset is large-scale product images and 3D models corresponding to real household objects. To create this dataset, the team has used a realistic, object-centric 3D dataset to measure the domain gap for single-view 3D reconstruction networks trained on synthetic objects. 

In addition, researchers can use multi-view images from ABO to measure the robustness of SOTA metric learning approaches to different camera viewpoints. Using the physically-based rendering materials in ABO, they can also perform single-and multi-view material estimation for various complex, real-world geometries. 

The dataset is available for download here


Task-driven Embodied Agents that Chat (TEACh) is a dataset of over 3K human-to-human interactive dialogues to complete tasks in a simulated household environment. A Commander with access to oracle information about a task communicates in natural language with a Follower. Following this, the Follower navigates through and interacts with the environment to complete tasks varying in complexity from ‘Make Coffee’ to ‘Prepare Breakfast,’ asking questions and getting additional information from the Commander. 

Check out the dataset here



The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning over the Apatio-temporal space of video. In this, dialogues are synthesised over multiple question turns, each injected with a set of cross-turn semantic relationships. Researchers can use DVD to analyse existing approaches, providing interesting insights into their abilities and limitations. It has been built from 11K CATER synthetic videos and contains ten instances of 10-round dialogues for each video, resulting in more than 100K dialogues and 1 million question-answer pairs. 

Check out the code and datasets here


OpenNeeds is the first large-scale, high-frame-rate, comprehensive and open-source dataset of non-eye (head, hand and scene) and eye (3D gaze vectors) data captured for 44 participants as they freely explored two virtual environments with many potential tasks like reading, drawing, shooting, object manipulation, etc. 


Intentonomy consists of 14K images covering a wide range of everyday scenes. These images are manually annotated/labelled with 28 intent categories derived from a social psychology taxonomy. Using this dataset, researchers can systematically study the extent of commonly used visual information, such as objects and context, contributing to human motive understanding. 

Check out the code and dataset on GitHub.

Download our Mobile App

Amit Raja Naik
Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

Career Building in ML & AI

31st May | Online

31st May - 1st Jun '23 | Online

Rakuten Product Conference 2023

15th June | Online

Building LLM powered applications using LangChain

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox