Top 10 Ready To Use Datasets on TensorFlow

Last year in February, the TensorFlow’s team introduced TensorFlow Datasets. Machine learning community can access public research datasets as tf.data.Datasets and as NumPy arrays. TFDS does all the tedious work of fetching the source data and preparing it into a common format on disk. It uses the tf.data API to build high-performance input pipelines, which are TensorFlow 2.0-ready and can be used with tf.keras models. 

TensorFlow Datasets provides many public datasets as

 tf.data.Datasets

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Installation:




pip install tensorflow-datasets

# Snippet:

import tensorflow_datasets as tfds

mnist_data = tfds.load("mnist")

mnist_train, mnist_test = mnist_data["train"], mnist_data["test"]

assert isinstance(mnist_train, tf.data.Dataset)

In the next section we take a look at few important datasets(h/t Lionbridge) that TensorFlow allows you to access with a single line of code:

Lsun

tfds.image.Lsun

LSUN  contains around one million labeled images for each of 10 scene categories and 20 object categories. We experiment with training popular convolutional networks and find that they achieve substantial performance gains when trained on this dataset.

Bigearthnet

tfds.image_classification.Bigearthnet

The BigEarthNet archive was constructed by the Remote Sensing Image Analysis (RSiM) Group and the Database Systems and Information Management (DIMA) Group at the Technische Universität Berlin (TU Berlin). The BigEarthNet database consists of 590,326 Sentinel-2 image patches. The image patch on the ground is 1.2 x 1.2 km with variable image size depending on the channel resolution.

vgg_face2

tfds.image_classification.VggFace2

VGGFace2 is a face recognition dataset with large variations in pose, age, illumination, ethnicity and profession. It contains images from identities spanning a wide range of different ethnicities, accents, professions and ages. All faces are captured “in the wild”, with pose and emotion variations and various lighting and occlusion conditions. 

aflw2k3d

tfds.image.Aflw2k3d

The images in AFLW2000-3D dataset can be used for the evaluation of 3D facial landmark detection models. The head poses in this dataset are very diverse, and the creators claim that it is hard to be detected by a CNN-based face detector. 

Bair_robot_pushing_small

tfds.video.BairRobotPushingSmall

Berkeley’s Robotics department created this data set and contained roughly 44,000 examples of robot pushing motions, including one training set (train) and two test sets of previously seen (testseen) and unseen (testnovel) objects. This is the small 64×64 version.

Voxceleb

tfds.audio.Voxceleb

Voxceleb is a large scale dataset and is popular for speaker identification tasks. The data is collected from over 1,251 speakers, with over 150k samples in total.

Librispeech

tfds.audio.Librispeech

LibriSpeech is consists of approximately 1000 hours of reading English speech with a sampling rate of 16 kHz, prepared. The data is derived from reading audiobooks from the LibriVox project.

Crema_d

tfds.audio.CremaD

CREMA-D is curated for training models for emotion recognition. This audio-visual ddata set consists of facial and vocal emotional expressions spoken in a range of basic emotional states (happy, sad, anger, fear, disgust, and neutral). This is a total of 7,442 clips of 91 actors with diverse ethnic backgrounds.

C4

tfds.text.C4

C4 is a cleaned version of the popular Common Crawl’s web crawl corpus. C4 has been used to train mega models like GPT-3. Having C4 is like scraping everything on the internet. The Common Crawl Foundation was created to democratize access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analysable.

civil_comments

tfds.text.CivilComments

One of the best datasets for sentiment analysis, CivilComments Dataset provides access to the primary seven labels. These labels were annotated by crowd workers. The toxicity and other labels fall in the range of 0 and 1. This data set is a duplication of the data used for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge.

These are few of the datasets for popular ML tasks in vision and speech. There are many more.

Check them here.

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Download our Mobile App

MachineHack

AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR