Popular Datasets Released By Tech Firms In 2021

This article lists some of the datasets open-sourced by big tech companies in 2021

Open source has dominated the tech landscape in 2021. Companies, big and small, are showing increased trust in the open-source community and contributing actively to it. It is not surprising because open source has been the backbone of rapid technological development and collaboration. In this article, we list some of the datasets open-sourced by big tech companies in 2021.


Multilingual Counterfactual Dataset From Amazon

Amazon released the Multilingual Counterfactual Dataset to help train machine learning models to recognise counterfactual statements. The project was started when there were no large scale datasets for counterfactual statements in product reviews made in multiple languages. This dataset annotates sentences from product reviews written in languages English, German and Japanese. A study revealed that only 1-2 per cent of sentences express counterfactuals in natural language texts; simply annotating a set of sentences selected randomly would yield a highly imbalanced dataset with a sparse training signal. The Multilingual Counterfactual Dataset helps ease such complications. This dataset is supplemented with annotation guidelines and definitions (worked on by professional linguists). Amazon has also provided the clue word list that is typical for counterfactual statements and used for initial data filtering.


GoEmotions By Google

GoEmotions is a dataset of fine-grained emotions — a human-annotated dataset of 58,000 Reddit comments that were extracted from popular subreddits and labelled with 27 emotion categories. As per Google, the GoEmotions taxonomy was designed keeping in mind psychology and data applicability. As per the general understanding, six emotions are considered basic; however, with GoEmotion, Google has considered 12 positive emotions, 11 negatives, four ambiguous and one marked as neutral. Since a wide range of emotions is considered, it is fairly easy to understand tasks requiring subtle differences in emotional expressions.


Sign up for your weekly dose of what's up in emerging technology.

Along with the dataset, Google also released a detailed tutorial that demonstrates neural model architecture training using GoEmotions and applying it for suggesting emojis based on conversational text.


ORBIT Dataset

Microsoft introduced the ORBIT dataset in partnership with City, the University of London. The dataset set a new standard for evaluating machine learning models in a few shot, high variation learning scenarios. This will help in training models for higher performance in real-world situations, specifically for people who are blind or have low vision. This dataset contains 3,822 videos of 486 objects recorded by people with low vision on their mobile phones. This benchmark reflects a practical, highly challenging recognition task. It offers a rich playground to research in robustness to few-shots and high variation conditions.

Download our Mobile App


Wikipedia-Based Image Text

It is a large multimodal multilingual dataset from Google. WIT is composed of a set of 37.6 million image-text examples. It contains 11.5 million unique images collected from across 108 Wikipedia languages. The size helps WIT to be used as a pretraining dataset for multimodal ML models. Google claimed that this dataset was the largest in terms of image-text samples at the time of publication. 

The researchers started by selecting Wikipedia pages with images before extracting image-text associations and the surrounding contexts. They also performed a rigorous filtering process to ensure data quality; this process included steps like text-based filtering, caption availability, image-based filtering to ensure correct size and licensing, and length and quality.

Ego4D by Facebook

Ego4D is a large egocentric video dataset and benchmark suite which has 3.025 hours of daily life activity. The video spans hundreds of scenarios captured by 855 camera weavers placed at 74 different locations in nine countries. Ego4D expands the volume of egocentric videos that are publicly available to the research community. A part of these videos is with audio, eye gaze, stereo, synchronised videos, and 3D meshes of the environment. Facebook (now Meta) also introduced new benchmark challenges that are centred around the first-person visual experience by querying episodic memory, analysing hand-object manipulation, social interaction, audio-visual conversation, and forecasting events.

Datasets by Hugging Face

Hugging Face released Datasets, a community library for NLP. It contains 650 unique datasets and has over 250 contributors. This library had been under development for about a year and has supported many novels cross dataset research projects and tasks. Hugging Face’s Datasets are designed to address the challenges of dataset management and support community culture and norms. To develop the Datasets library, Hugging Face conducted a public hackathon which resulted in 485 commits. This library now includes continuous data types and multi-dimensional arrays for images and video data, along with audio type.

More Great AIM Stories

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

AIM Upcoming Events

Regular Passes expire on 3rd Mar

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Early Bird Passes expire on 17th Feb

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, Virtual
Deep Learning DevCon 2023
27 May, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

A beginner’s guide to image processing using NumPy

Since images can also be considered as made up of arrays, we can use NumPy for performing different image processing tasks as well from scratch. In this article, we will learn about the image processing tasks that can be performed only using NumPy.

RIP Google Stadia: What went wrong?

Google has “deprioritised” the Stadia game streaming platform and wants to offer its Stadia technology to select partners in a new service called “Google Stream”.