Most Popular Datasets Announced By Big Tech In 2021

Some notable players in creating datasets include Apple, Microsoft, Google, Amazon, Facebook and others.
Most Popular Datasets Announced By Big Tech In 2021

Advertisement

According to MillionInsights, the global AI training dataset market is projected to grow at 22.5 per cent CAGR by 2027, from $956.5 million in 2019. The growth is fueled by various data-driven applications, including voice recognition, image recognition, etc. In addition, the need for human and machine interaction is said to provide new growth opportunities for market players.

In 2020, nearly 33 per cent of the market share was catered by the text segment. This is due to the high use of text datasets in the IT sector for various automation processes such as text classification, caption generation, speech recognition, etc. Meanwhile, the audio segment is expected to hold a moderate share due to the availability of a wide range of audio datasets. These include speech datasets, music datasets, speech commands, multimodal emotion lines datasets, environmental audio datasets, etc. The image/video segment is expected to register massive growth in the coming years. 

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Last year, Google launched a new AI training dataset called Google-landmarks-v2 that contains millions of images and thousands of landmarks. It also launched two challenges on Kaggle, landmark retrieval 2020, and namely landmark recognition 2020. These were launched for image retrieval and instance recognition and to train better and robust systems. 

To foster a data-sharing ecosystem that will encourage data publishers to follow best practices for data storage and publication and showcase datasets produced by scientists, Google has launched Google Dataset Search. The beta version was launched on January 23, 2020. Microsoft has also launched Microsoft Research Open Data, a collection of free datasets from Microsoft Research to advance research in NLP, computer vision, and domain-specific sciences. 

Several techniques like semi-supervised, self-supervised, and transfer learning have been developed in the last few years, shaping the training dataset ecosystem to the next level. In addition, big tech and other players are developing automated tools to help develop high-quality datasets to unleash the true potential of AI. Some notable players in creating datasets include Apple, Microsoft, Google, Amazon, Facebook, Scale AI, Sama, and others. 

Here’s a list of the most popular datasets announced by big tech in 2021.  

Apple 

Hypersim

Hypersim is a photorealistic synthetic dataset for holistic indoor scene understanding. The dataset was created using a large repository of synthetic scenes created by professional artists. They generate 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry. 

The dataset: 

  • Releases on publicly available 3D assets 
  • Includes complete scene geometry, material information, and lightning information for every scene
  • Includes dense per-pixel semantic instance segmentation for every image 
  • Factors every image into diffuse reflectance, diffuse illumination, and non-diffuse residual term that captures view-dependent lighting effects. 

These features of the dataset make it well-suited for geometric learning problems that require direct 3D supervision, multi-task learning problems that require reasoning jointly over multiple input and output modalities, and inverse rendering problems. Interestingly, it is possible to generate an entire dataset from scratch for roughly half of the cost of training a SOTA natural language processing model. 

Check out the code and dataset here

Sep-28K

Stuttering Events in Podcasts (SEP-28K) is a dataset containing over 28K clips labelled with five event types, including blocks, prolongations, sound repetitions, word repetitions, and interjections. The audio is collected from public podcasts largely consisting of people who stutter interviewing other people who stutter. 

Check out the code and dataset on GitHub

Microsoft 

Indoor Location Dataset

Released along with Microsoft Indoor Location Competition 2.0, Indoor Location Dataset consists of dense indoor signatures of WiFi, geomagnetic field, iBeacons, etc., and ground truth collected by Android smartphones from hundreds of buildings in Chinese cities. The dataset can be used in the research and development of indoor space, including localisation and navigation. 

Check out the code and dataset on GitHub

Odia Speech Data and Model

In partnership with Navana Tech, Microsoft Research India is open-sourcing 1648 hours of validated Odia speech dataset and baseline model for Odia speech recognition. The dataset consists of recordings in agriculture, banking, and healthcare in four dialects of Odia collected from five different districts. 

Google 

WIT

Wikipedia-Based Image Text (WIT) dataset is a large multimodal dataset created by extracting multiple text selections associated with an image from Wikipedia. The dataset comprises a curated set of 37.6 million entity-rich image-text examples with 11.5 million unique images across 108 languages. 

Check out the dataset on GitHub

Translated Wikipedia Biographies

This dataset has been designed to analyse common gender errors in machine translation, such as incorrect gender choices in pro-drop, possessives, and gender agreement. 

Each instance of the dataset represents a person, a rock band or a sports team. A long text translation represents each entity. Articles are written in native English and have been professionally translated to Spanish and German. For Spanish, translations are optimised for pronoun-drop so that the same set could analyse pro-drop (Spanish to English) and gender agreement (English to Spanish). 

RxR 

Room-Across-Room (RxR) is a new dataset for vision-and-language navigation (VLN). It is the first multilingual dataset for VLN, containing 126,069 human-annotated navigation instructions in three typologically diverse languages – English, Hindi, and Telugu. In the dataset, each instruction describes a path through a photorealistic simulator populated with an indoor environment from the Matterport3D dataset, which includes 3D captures of offices, homes, and public buildings. 

Check out the dataset on GitHub

Amazon 

Commonsense-Dialogues 

Commonsense-Dialogues is a crowdsourced dataset of 11,000 dialogues grounded in social contexts involving the use of commonsense. The social contexts used were sourced/collected from the train split of the SocialIQA dataset, a multiple-choice question-answering based social commonsense reasoning benchmark. 

Each Turker was presented with a social context in this dataset and asked to write a dialogue of 4-6 turns between two people based on the events described in the context. In addition, the Turker was asked to alternate between an individual referenced in the context and a third-party friend. 

Check out the dataset here

ABO

Amazon Berkeley Objects (ABO) dataset is large-scale product images and 3D models corresponding to real household objects. To create this dataset, the team has used a realistic, object-centric 3D dataset to measure the domain gap for single-view 3D reconstruction networks trained on synthetic objects. 

In addition, researchers can use multi-view images from ABO to measure the robustness of SOTA metric learning approaches to different camera viewpoints. Using the physically-based rendering materials in ABO, they can also perform single-and multi-view material estimation for various complex, real-world geometries. 

The dataset is available for download here

TEACh

Task-driven Embodied Agents that Chat (TEACh) is a dataset of over 3K human-to-human interactive dialogues to complete tasks in a simulated household environment. A Commander with access to oracle information about a task communicates in natural language with a Follower. Following this, the Follower navigates through and interacts with the environment to complete tasks varying in complexity from ‘Make Coffee’ to ‘Prepare Breakfast,’ asking questions and getting additional information from the Commander. 

Check out the dataset here

Facebook 

DVD

The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning over the Apatio-temporal space of video. In this, dialogues are synthesised over multiple question turns, each injected with a set of cross-turn semantic relationships. Researchers can use DVD to analyse existing approaches, providing interesting insights into their abilities and limitations. It has been built from 11K CATER synthetic videos and contains ten instances of 10-round dialogues for each video, resulting in more than 100K dialogues and 1 million question-answer pairs. 

Check out the code and datasets here

OpenNEEDS

OpenNeeds is the first large-scale, high-frame-rate, comprehensive and open-source dataset of non-eye (head, hand and scene) and eye (3D gaze vectors) data captured for 44 participants as they freely explored two virtual environments with many potential tasks like reading, drawing, shooting, object manipulation, etc. 

Intentonomy

Intentonomy consists of 14K images covering a wide range of everyday scenes. These images are manually annotated/labelled with 28 intent categories derived from a social psychology taxonomy. Using this dataset, researchers can systematically study the extent of commonly used visual information, such as objects and context, contributing to human motive understanding. 

Check out the code and dataset on GitHub.

More Great AIM Stories

Amit Raja Naik
Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MORE FROM AIM
Amit Raja Naik
Oh boy, is JP Morgan wrong?

The global brokerage firm has downgraded Tata Consultancy Services, HCL Technology, Wipro, and L&T Technology to ‘underweight’ from ‘neutral’ and slashed its target price by 15-21 per cent.