Top 10 Ready To Use Datasets on TensorFlow

Last year in February, the TensorFlow’s team introduced TensorFlow Datasets. Machine learning community can access public research datasets as and as NumPy arrays. TFDS does all the tedious work of fetching the source data and preparing it into a common format on disk. It uses the API to build high-performance input pipelines, which are TensorFlow 2.0-ready and can be used with tf.keras models. 

TensorFlow Datasets provides many public datasets as


Sign up for your weekly dose of what's up in emerging technology.


pip install tensorflow-datasets

Download our Mobile App

# Snippet:

import tensorflow_datasets as tfds

mnist_data = tfds.load("mnist")

mnist_train, mnist_test = mnist_data["train"], mnist_data["test"]

assert isinstance(mnist_train,

In the next section we take a look at few important datasets(h/t Lionbridge) that TensorFlow allows you to access with a single line of code:



LSUN  contains around one million labeled images for each of 10 scene categories and 20 object categories. We experiment with training popular convolutional networks and find that they achieve substantial performance gains when trained on this dataset.



The BigEarthNet archive was constructed by the Remote Sensing Image Analysis (RSiM) Group and the Database Systems and Information Management (DIMA) Group at the Technische Universität Berlin (TU Berlin). The BigEarthNet database consists of 590,326 Sentinel-2 image patches. The image patch on the ground is 1.2 x 1.2 km with variable image size depending on the channel resolution.



VGGFace2 is a face recognition dataset with large variations in pose, age, illumination, ethnicity and profession. It contains images from identities spanning a wide range of different ethnicities, accents, professions and ages. All faces are captured “in the wild”, with pose and emotion variations and various lighting and occlusion conditions. 



The images in AFLW2000-3D dataset can be used for the evaluation of 3D facial landmark detection models. The head poses in this dataset are very diverse, and the creators claim that it is hard to be detected by a CNN-based face detector. 


Berkeley’s Robotics department created this data set and contained roughly 44,000 examples of robot pushing motions, including one training set (train) and two test sets of previously seen (testseen) and unseen (testnovel) objects. This is the small 64×64 version.


Voxceleb is a large scale dataset and is popular for speaker identification tasks. The data is collected from over 1,251 speakers, with over 150k samples in total.


LibriSpeech is consists of approximately 1000 hours of reading English speech with a sampling rate of 16 kHz, prepared. The data is derived from reading audiobooks from the LibriVox project.


CREMA-D is curated for training models for emotion recognition. This audio-visual ddata set consists of facial and vocal emotional expressions spoken in a range of basic emotional states (happy, sad, anger, fear, disgust, and neutral). This is a total of 7,442 clips of 91 actors with diverse ethnic backgrounds.



C4 is a cleaned version of the popular Common Crawl’s web crawl corpus. C4 has been used to train mega models like GPT-3. Having C4 is like scraping everything on the internet. The Common Crawl Foundation was created to democratize access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analysable.



One of the best datasets for sentiment analysis, CivilComments Dataset provides access to the primary seven labels. These labels were annotated by crowd workers. The toxicity and other labels fall in the range of 0 and 1. This data set is a duplication of the data used for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge.

These are few of the datasets for popular ML tasks in vision and speech. There are many more.

Check them here.

Support independent technology journalism

Get exclusive, premium content, ads-free experience & more

Rs. 299/month

Subscribe now for a 7-day free trial

More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

All you need to know about Graph Embeddings

Embeddings can be the subgroups of a group, similarly, in graph theory embedding of a graph can be considered as a representation of a graph on a surface, where points of that surface are made up of vertices and arcs are made up of edges