Now Reading
A Compilation Of 16 Datasets Released By Google

A Compilation Of 16 Datasets Released By Google

Ambika Choudhury

Datasets play a critical role in a wide array of areas, including research, analysis, decision-making, and more. With the advent of emerging technologies, organisations have been shifting from their traditional approaches, and heavily relying on vast piles of data for their decision-making purposes. 

Google has always been at the forefront when it comes to scientific research. In this article, we have compiled a list of 16 open-sourced datasets – in alphabetical order – released by the tech giant: 

1| AudioSet

About: Among the popular audio datasets, AudioSet is a large-scale dataset of manually annotated audio events. It includes an expanding ontology of 632 audio event classes, and a collection of 20,84,320 human-labelled 10-second sound clips drawn from YouTube videos. 



Click here to download.

2| AVA Dataset

About: AVA is a video dataset of spatio-temporally localised Atomic Visual Actions (AVA) that provides audiovisual annotations of video to improve and understand human activity. The dataset annotates 80 atomic visual actions in 430 15-minute movie clips, where actions are localised in space and time. AVA dataset is a collection of 1.62 million action labels with multiple labels per human occurring frequently.

Click here to download.

3| Cartoon Set

About: Cartoon Set is a collection of random, 2D cartoon avatar images where the cartoons vary in 10 artwork categories, four colour categories and four proportion categories, with a total of approximately 1,013 possible combinations. The cartoons in this dataset helped develop the technology behind the personalised stickers in Google Allo.

Click here to download.

4| Coached Conversational Preference Elicitation

About: This dataset consists of 502 English dialogues with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. The dataset has been gathered using a Wizard-of-Oz methodology between two paid crowd-workers, where one worker plays the role of an ‘assistant’, while the other plays the role of a ‘user’. 



Click here to download.

5| DiscoFuse

About: DiscoFuse is a large-scale dataset for Discourse-Based Sentence Fusion (DiscoFuse) that includes approximately 60 million sentence fusion examples. Sentence fusion is the task of joining several independent sentences into a single coherent text. 

Click here to download.

6| Google’s Conceptual Captions

About: Google‘s Conceptual Captions dataset consists of approximately 3.3 million images annotated with captions. In contrast with the curated style of other image caption annotations, Conceptual Caption images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles.  

Click here to download.

7| Grasping Dataset

About: The Grasping Dataset contains roughly 8,00,000 plus grasp attempts over two months, using between 6 and 14 robotic manipulators at any given time, with differences in camera placement and hardware. 

Click here to download.

8| HDR+ Burst Photography Dataset 

About: This dataset consists of 3,640 bursts that are made up of 28,461 images in total and organised into subfolders, including the results of the image processing pipeline. Each burst consists of the raw burst input in DNG format.

Click here to download.

9| Noun Verb

About: This dataset contains naturally-occurring 30,000 English sentences that feature non-trivial noun-verb ambiguity. The dataset contains sentences in CoNLL format, and each sentence has a single token that has been manually annotated as either VERB or NON-VERB. 

Click here to download.

10| Open Images Dataset V6

About: The Open Images Dataset V6 is one of the popular datasets released by Google. It includes approximately 9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localised narratives. The dataset contains 16 million bounding boxes for 600 object classes on 1.9 million images. This makes it the largest existing dataset with object location annotations.  

Click here to download.

11| RealEstate10K

About: RealEstate10K is a large dataset of camera poses corresponding to 10 million frames derived from about 80,000 video clips, gathered from about 10,000 YouTube videos. For each clip, the poses form a trajectory where each pose specifies the camera position and orientation along the trajectory. These poses are derived by running SLAM and bundle adjustment algorithms on a large set of videos.

See Also
10 NLP Open-Source Datasets To Start Your First NLP Project

Click here to download.

12| Taskmaster-1

About: The Taskmaster-1 dataset consists of 13,215 task-based dialogues in English, including 5,507 spoken and 7,708 written dialogues created with two distinct methods. In this dataset, each conversation falls into one of 6 domains. These are – ordering pizza, creating appointments for an auto repair, ride service set up, ordering movie tickets, ordering coffee drinks, and making reservations in restaurants. 

Click here to download.

13| The Quick, Draw! Dataset

About: The Quick Draw Dataset includes 50 million drawings across 345 categories that are contributed by players of the game Quick, Draw! The drawings were captured as timestamped vectors, which are tagged with metadata, including what the player was asked to draw and the location of the player.

Click here to download.

14| The MAESTRO Dataset

About: The MIDI and Audio Edited for Synchronous Tracks and Organisation – or MAESTRO – dataset is a collection of over 200 hours of virtuosic piano performances, captured with fine alignment (~3 ms) between note labels and audio waveforms.

Click here to download.

15| Taskmaster-2

About: The Taskmaster-2 dataset consists of 17,289 dialogues in seven domains: restaurants (3,276), food ordering (1,050), movies (3,047), hotels (2,355), flights (2,481), music (1,602), and sports (3,478). All dialogues in this dataset were collected using the same Wizard of Oz (WOz) system used in Taskmaster-1, where crowdsourced workers playing the “user” interacted with human operators playing the “digital assistant” using a web-based interface.

Click here to download.

16| Youtube-8M Segments Dataset

About: The YouTube-8M Segments dataset is an extension of the YouTube-8M dataset with human-verified segment annotations. It is a collection of human-verified labels on about 2,37,000 segments on 1,000 classes from the validation set of the YouTube-8M dataset, where each video will again come with time-localised frame-level features, so that classifier predictions can be made at segment-level granularity. 

Click here to download.

Provide your comments below

comments


If you loved this story, do join our Telegram Community.


Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top