MITB Banner

Top 10 Ready To Use Datasets on TensorFlow

Share

Last year in February, the TensorFlow’s team introduced TensorFlow Datasets. Machine learning community can access public research datasets as tf.data.Datasets and as NumPy arrays. TFDS does all the tedious work of fetching the source data and preparing it into a common format on disk. It uses the tf.data API to build high-performance input pipelines, which are TensorFlow 2.0-ready and can be used with tf.keras models. 

TensorFlow Datasets provides many public datasets as

 tf.data.Datasets

Installation:

pip install tensorflow-datasets

# Snippet:

import tensorflow_datasets as tfds

mnist_data = tfds.load("mnist")

mnist_train, mnist_test = mnist_data["train"], mnist_data["test"]

assert isinstance(mnist_train, tf.data.Dataset)

In the next section we take a look at few important datasets(h/t Lionbridge) that TensorFlow allows you to access with a single line of code:

Lsun

tfds.image.Lsun

LSUN  contains around one million labeled images for each of 10 scene categories and 20 object categories. We experiment with training popular convolutional networks and find that they achieve substantial performance gains when trained on this dataset.

Bigearthnet

tfds.image_classification.Bigearthnet

The BigEarthNet archive was constructed by the Remote Sensing Image Analysis (RSiM) Group and the Database Systems and Information Management (DIMA) Group at the Technische Universität Berlin (TU Berlin). The BigEarthNet database consists of 590,326 Sentinel-2 image patches. The image patch on the ground is 1.2 x 1.2 km with variable image size depending on the channel resolution.

vgg_face2

tfds.image_classification.VggFace2

VGGFace2 is a face recognition dataset with large variations in pose, age, illumination, ethnicity and profession. It contains images from identities spanning a wide range of different ethnicities, accents, professions and ages. All faces are captured “in the wild”, with pose and emotion variations and various lighting and occlusion conditions. 

aflw2k3d

tfds.image.Aflw2k3d

The images in AFLW2000-3D dataset can be used for the evaluation of 3D facial landmark detection models. The head poses in this dataset are very diverse, and the creators claim that it is hard to be detected by a CNN-based face detector. 

Bair_robot_pushing_small

tfds.video.BairRobotPushingSmall

Berkeley’s Robotics department created this data set and contained roughly 44,000 examples of robot pushing motions, including one training set (train) and two test sets of previously seen (testseen) and unseen (testnovel) objects. This is the small 64×64 version.

Voxceleb

tfds.audio.Voxceleb

Voxceleb is a large scale dataset and is popular for speaker identification tasks. The data is collected from over 1,251 speakers, with over 150k samples in total.

Librispeech

tfds.audio.Librispeech

LibriSpeech is consists of approximately 1000 hours of reading English speech with a sampling rate of 16 kHz, prepared. The data is derived from reading audiobooks from the LibriVox project.

Crema_d

tfds.audio.CremaD

CREMA-D is curated for training models for emotion recognition. This audio-visual ddata set consists of facial and vocal emotional expressions spoken in a range of basic emotional states (happy, sad, anger, fear, disgust, and neutral). This is a total of 7,442 clips of 91 actors with diverse ethnic backgrounds.

C4

tfds.text.C4

C4 is a cleaned version of the popular Common Crawl’s web crawl corpus. C4 has been used to train mega models like GPT-3. Having C4 is like scraping everything on the internet. The Common Crawl Foundation was created to democratize access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analysable.

civil_comments

tfds.text.CivilComments

One of the best datasets for sentiment analysis, CivilComments Dataset provides access to the primary seven labels. These labels were annotated by crowd workers. The toxicity and other labels fall in the range of 0 and 1. This data set is a duplication of the data used for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge.

These are few of the datasets for popular ML tasks in vision and speech. There are many more.

Check them here.

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.