Last updated January 27, 2021
In AI Mysteries

Comprehensive Guide To 9 Most Important Image Datasets For Data Scientists

In this article, we will discuss the various image datasets that are readily available for training machine learning models.

Published on January 4, 2021
by Jayita Bhattacharyya

Vision data is the most widely used form of data around us. Almost every industry from fashion to streaming platforms, medical, legal, finance all has its usage for various use-cases. Social media being one of the biggest examples. AI has taken over everything in the world now and has done wonders to image data. Machine learning and deep learning models as we know are well trained where there are diverse data, so these algorithms are data hunger. Thus there became a need to develop better datasets to address biases present in these algorithms.

Computer vision is a field where computers deal with digital images in the form of pixel values. In other words, computers are made to have an understanding of images/videos as humans do. It includes processing, analyzing, transforming, extracting features and various other operations done to an image. Earlier image processing techniques used have certain drawbacks as they fail to bring out high-level dimensionality accurately. Now deep learning algorithms have overcome these problems and have proven to be much reliable. Nowadays they are used in almost all kinds of tasks such as object detection, object tracking, image classification, image segmentation and localization, 3D pose estimation, video matting and many more we can keep naming.

Taking image datasets forward now GANs (generative adversarial networks) have taken over. They can increase the size of datasets by including synthetic data. Besides, it can make synthetic data imitate exactly like real-world data, for example – deepfakes. In recent years it has gained much attention, and more research and development is revolving around it.

In this article, we will discuss the various image datasets that are readily available for training machine learning models.

MNIST AND ITS TYPES

MNIST is the handwritten digits dataset. The very first of its kind to have been developed in 1999 by Yan LeCunn and other researchers. It is a very basic dataset for beginners, starting deep learning with computer vision. Using simple Convnet architectures these are very easy as it is preprocessed in grayscale images (total 70,000 out of which 60,000 training set and 10,000 test set) each of 28*28 pixels associated with numbers 0 to 9 as labels.

Over the years different variants of MNIST have been released namely – binarized MNIST, KMNIST, EMNIST, QMNIST, and 3D MNIST. Binarized MNIST contains the binarized version of original digits MNIST. EMNIST or extended MNIST is an extension by adding more data to the original MNIST. KMNIST is Kuzushiji MNIST which is a drop-in replacement of the original MNIST with NumPy format. QMNIST developed by Facebook AI research contains 50,000 additional images apart from the original MNIST. 3D MNIST, as the name suggests, contains 3-dimensional digit representations. It is a smaller dataset compared to MNIST. All of these datasets are open-sourced and readily available to use in ML model training. There are some pre-built libraries in Tensorflow and PyTorch for implementing these datasets.

For implementation and other information -> 6 MNIST Image Datasets

FASHION MNIST

MNIST could not explore many aspects of deep learning algorithms based on computer vision, so Fashion MNIST was released. As the name suggests, it contains ten categories of apparels namely T-shirt/top, trouser, pullover, dress, coat, sandals, shirt, sneakers, bags, ankle boots with class labels 0 to 9 as MNIST. All of these images are in grayscale with 28*28 pixels each. With fashion MNIST new benchmarks were achieved in deep learning. This also has pre-built libraries to be readily used for model training. Recently fashion MNIST was used with GANs and have generated really good results showing new apparel designs.

For implementation and other information -> Fashion MNIST

MEDICAL MNIST AND TYPES

Following the MNIST type structure, many other datasets were released to fulfil different purposes. With neural networks finding relevance in all fields, medical science has many things to be covered and addressed. Bioinformatics data science has now been much in research and achieved some of the results that weren’t addressed for years. Different medical MNIST datasets have evolved over the years, MedMNIST is one of the recently released (in 2020) benchmark datasets in them. It is a collection of 10 open sourced medical datasets namely – PathMNIST, ChestMNIST, DermaMNIST, OCTMNIST, PneumoniaMNIST, RetinaMNIST, OrganMNIST(axial, coronal, sagittal). These datasets have been implemented using machine learning and AutoML.

Rest consist of medical MNIST, skin cancer MNIST and colorectal histology MNIST. Medical MNIST consists of 6 classes – ChestCT, BreastMRI, CXR, Hand, HeadCT, AbdomenCT. Colorectal cancer histology Multiclass classification for texture analysis belonging to 8 classes of tissues. Skin Cancer MNIST contains 7 classes – Melanocytic nevi, Melanoma, Benign keratosis-like lesions, Basal cell carcinoma, Actinic keratoses, Vascular lesions, Dermatofibroma. Different libraries have been implemented around them and can be readily used for building medical research projects.

For implementation and other information -> Medical MNIST

SIGN LANGUAGE MNIST

Sign language MNIST was released to bring help for hearing and speaking impaired people to convey messages through hand gestures. It is similar in structure to the original MNIST in pixel dimensions and some other parameters. There are 24 classes present from A to Z except for J and Z. It is present in CSV format with labels and pixel values for each. It is developed from American Sign Language letter database.

For implementation and other information -> Sign Language MNIST

GOOGLE OPEN IMAGES

Google has a huge open-source vision dataset which serves many purposes. Along with images it contains annotations, object relationship in images, object detection and bounding boxes, image segmentation and other recently released localized narratives. It has gone through 6 versions and currently the v6 version is in use. It is accessible through Google Cloud Vision API. Images have been crowdsourced and validated by professional annotators. Two of its most significant implementations have been seen in artistic style transfer and deep dream.

For implementation and other information -> Open Images

IMAGENET AND VARIANTS

Imagenet is one of the greatest achievements in computer vision. Until now Imagenet is the biggest image dataset with over 14 million images spread across 20,000 different classes. Imagenet every year holds a competition on the dataset where different deep learning algorithms/models compete to win it. With every year passing the error rates have been reduced and it’s remarkable how to have crossed the human average error rate. Imagenet2012 (started by Fei Fei Li, later enhanced by many other researchers), thereafter many variants came over as drop-in replacement to original Imagenet namely – Imagenet2012_real, Imagenet2012_subset, Mini Imagenet, Imagenet_A & Imagenet_O, Imagenet_R, Imagenet_resized. These datasets were released along with research papers specifying their relevance. All of these have pre-built libraries to directly be used in model training.

For implementation and other information -> Imagenet

CIFAR 10 & 100

Cifar contains 80million tiny images dataset. Cifar-10 contains 10 object classes namely – aeroplane, bird, car, cat, deer, dog, frog, horse, ship, and truck. These images are in the form of 32×32 pixels RGB format. Cifar 100 is an extension to Cifar 10. It contains 100 object classes divided into 20 main classes- aquatic mammals, fishes, large omnivores and herbivores, medium-sized mammals, flower, food container, household electrical devices, fruit and vegetable, household furniture, insects, large carnivores, large man-made outdoor things, large natural outdoor scenes, non-insect invertebrates, people, reptiles, trees, small mammals, vehicles 1, vehicles 2. Both these datasets have an implementation in deep learning libraries.

For implementation and other information -> CIFAR10 & CIFAR100

STL 10

The STL10 dataset was built inspired by the Cifar10 dataset. It is used in unsupervised learning. Divided into 10 classes – aeroplane, birds, car, cat, deer, dog, horse, monkey, ship, truck. Images are in 96×96 pixels in RGB. Total of 13000 images divided into 5000 training and 8000 test sets. It has implementations in deep learning libraries Tensorflow and PyTorch.

For implementation and other information -> STL10

CALTECH DATASETS

Caltech consists of 4 different datasets – Caltech 101 (containing 100 object classes of common daily use such as fans, cars, boats, lamps etc and 1 background clutter), Caltech 256 (extension to Caltech101, contains more classes and larger background clutter for testing), Caltech Birds 2010 (200 bird species) and Caltech Birds 2011(extension to Caltech Birds 2010). All these images have annotations present with bounding boxes and other information. These datasets have implementations in deep learning libraries.

For implementation and other information -> Caltech

Access all our open Survey & Awards Nomination forms in one place >>

Jayita Bhattacharyya

Machine learning and data science enthusiast. Eager to learn new technology advances. A self-taught techie who loves to do cool stuff using technology for fun and worthwhile.

Comprehensive Guide To 9 Most Important Image Datasets For Data Scientists

Jayita Bhattacharyya

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

KissanAI Releases Dhenu Llama 3, an Indic LLM for Farmers

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Is it Humane to Bash Humane Ai Pin?

Meta Llama 3 Now Available on Databricks For Enterprise

How Databricks is Enabling Agriculture’s Data Revolution with UPL

How Good is Llama 3 for Indic Languages?

OpenAI Hires Pragya Misra As Its First Employee in India

Meta Forces Developers Cite ‘Llama 3’ in their AI Development

India is Making its Own AI Servers

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.

AIM Launches the 3rd Edition of Data Engineering Summit. May 30-31, Bengaluru