Open source has dominated the tech landscape in 2021. Companies, big and small, are showing increased trust in the open-source community and contributing actively to it. It is not surprising because open source has been the backbone of rapid technological development and collaboration. In this article, we list some of the datasets open-sourced by big tech companies in 2021.
Multilingual Counterfactual Dataset From Amazon
Amazon released the Multilingual Counterfactual Dataset to help train machine learning models to recognise counterfactual statements. The project was started when there were no large scale datasets for counterfactual statements in product reviews made in multiple languages. This dataset annotates sentences from product reviews written in languages English, German and Japanese. A study revealed that only 1-2 per cent of sentences express counterfactuals in natural language texts; simply annotating a set of sentences selected randomly would yield a highly imbalanced dataset with a sparse training signal. The Multilingual Counterfactual Dataset helps ease such complications. This dataset is supplemented with annotation guidelines and definitions (worked on by professional linguists). Amazon has also provided the clue word list that is typical for counterfactual statements and used for initial data filtering.
GoEmotions By Google
GoEmotions is a dataset of fine-grained emotions — a human-annotated dataset of 58,000 Reddit comments that were extracted from popular subreddits and labelled with 27 emotion categories. As per Google, the GoEmotions taxonomy was designed keeping in mind psychology and data applicability. As per the general understanding, six emotions are considered basic; however, with GoEmotion, Google has considered 12 positive emotions, 11 negatives, four ambiguous and one marked as neutral. Since a wide range of emotions is considered, it is fairly easy to understand tasks requiring subtle differences in emotional expressions.
Along with the dataset, Google also released a detailed tutorial that demonstrates neural model architecture training using GoEmotions and applying it for suggesting emojis based on conversational text.
ORBIT Dataset
Microsoft introduced the ORBIT dataset in partnership with City, the University of London. The dataset set a new standard for evaluating machine learning models in a few shot, high variation learning scenarios. This will help in training models for higher performance in real-world situations, specifically for people who are blind or have low vision. This dataset contains 3,822 videos of 486 objects recorded by people with low vision on their mobile phones. This benchmark reflects a practical, highly challenging recognition task. It offers a rich playground to research in robustness to few-shots and high variation conditions.
Wikipedia-Based Image Text
It is a large multimodal multilingual dataset from Google. WIT is composed of a set of 37.6 million image-text examples. It contains 11.5 million unique images collected from across 108 Wikipedia languages. The size helps WIT to be used as a pretraining dataset for multimodal ML models. Google claimed that this dataset was the largest in terms of image-text samples at the time of publication.
The researchers started by selecting Wikipedia pages with images before extracting image-text associations and the surrounding contexts. They also performed a rigorous filtering process to ensure data quality; this process included steps like text-based filtering, caption availability, image-based filtering to ensure correct size and licensing, and length and quality.
Ego4D by Facebook
Ego4D is a large egocentric video dataset and benchmark suite which has 3.025 hours of daily life activity. The video spans hundreds of scenarios captured by 855 camera weavers placed at 74 different locations in nine countries. Ego4D expands the volume of egocentric videos that are publicly available to the research community. A part of these videos is with audio, eye gaze, stereo, synchronised videos, and 3D meshes of the environment. Facebook (now Meta) also introduced new benchmark challenges that are centred around the first-person visual experience by querying episodic memory, analysing hand-object manipulation, social interaction, audio-visual conversation, and forecasting events.
Datasets by Hugging Face
Hugging Face released Datasets, a community library for NLP. It contains 650 unique datasets and has over 250 contributors. This library had been under development for about a year and has supported many novels cross dataset research projects and tasks. Hugging Face’s Datasets are designed to address the challenges of dataset management and support community culture and norms. To develop the Datasets library, Hugging Face conducted a public hackathon which resulted in 485 commits. This library now includes continuous data types and multi-dimensional arrays for images and video data, along with audio type.