Last year in February, the TensorFlow’s team introduced TensorFlow Datasets. Machine learning community can access public research datasets as tf.data.Datasets and as NumPy arrays. TFDS does all the tedious work of fetching the source data and preparing it into a common format on disk. It uses the tf.data API to build high-performance input pipelines, which are TensorFlow 2.0-ready and can be used with tf.keras models.
TensorFlow Datasets provides many public datasets as
pip install tensorflow-datasets
import tensorflow_datasets as tfds
mnist_data = tfds.load("mnist")
mnist_train, mnist_test = mnist_data["train"], mnist_data["test"]
assert isinstance(mnist_train, tf.data.Dataset)
In the next section we take a look at few important datasets(h/t Lionbridge) that TensorFlow allows you to access with a single line of code:
LSUN contains around one million labeled images for each of 10 scene categories and 20 object categories. We experiment with training popular convolutional networks and find that they achieve substantial performance gains when trained on this dataset.
The BigEarthNet archive was constructed by the Remote Sensing Image Analysis (RSiM) Group and the Database Systems and Information Management (DIMA) Group at the Technische Universität Berlin (TU Berlin). The BigEarthNet database consists of 590,326 Sentinel-2 image patches. The image patch on the ground is 1.2 x 1.2 km with variable image size depending on the channel resolution.
VGGFace2 is a face recognition dataset with large variations in pose, age, illumination, ethnicity and profession. It contains images from identities spanning a wide range of different ethnicities, accents, professions and ages. All faces are captured “in the wild”, with pose and emotion variations and various lighting and occlusion conditions.
The images in AFLW2000-3D dataset can be used for the evaluation of 3D facial landmark detection models. The head poses in this dataset are very diverse, and the creators claim that it is hard to be detected by a CNN-based face detector.
Berkeley’s Robotics department created this data set and contained roughly 44,000 examples of robot pushing motions, including one training set (train) and two test sets of previously seen (testseen) and unseen (testnovel) objects. This is the small 64×64 version.
Voxceleb is a large scale dataset and is popular for speaker identification tasks. The data is collected from over 1,251 speakers, with over 150k samples in total.
LibriSpeech is consists of approximately 1000 hours of reading English speech with a sampling rate of 16 kHz, prepared. The data is derived from reading audiobooks from the LibriVox project.
CREMA-D is curated for training models for emotion recognition. This audio-visual ddata set consists of facial and vocal emotional expressions spoken in a range of basic emotional states (happy, sad, anger, fear, disgust, and neutral). This is a total of 7,442 clips of 91 actors with diverse ethnic backgrounds.
C4 is a cleaned version of the popular Common Crawl’s web crawl corpus. C4 has been used to train mega models like GPT-3. Having C4 is like scraping everything on the internet. The Common Crawl Foundation was created to democratize access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analysable.
One of the best datasets for sentiment analysis, CivilComments Dataset provides access to the primary seven labels. These labels were annotated by crowd workers. The toxicity and other labels fall in the range of 0 and 1. This data set is a duplication of the data used for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge.
These are few of the datasets for popular ML tasks in vision and speech. There are many more.
Check them here.