MITB Banner

Watch More

Guide To MNIST Datasets For Fashion And Medical Applications

MNIST is the best to know for benchmark datasets in several deep learning applications. Taking a step forward many institutions and researchers have collaborated together to create MNIST like datasets with other kinds of data such as fashion, medical images, sign languages, skin cancers, colorectal cancer histology and skin cancer MNIST.

We all know MNIST is a famous dataset for handwritten digits to get started with computer vision in deep learning. MNIST is the best to know for benchmark datasets in several deep learning applications. Taking a step forward many institutions and researchers have collaborated together to create MNIST like datasets with other kinds of data such as fashion, medical images, sign languages, skin cancers, colorectal cancer histology and skin cancer MNIST.

MNIST was not enough to tackle all kinds of computer vision problems. MNIST was so well pre-processed that beginners could not learn much out of it. Using a simple ConvNet architecture could give more than 90% accuracy as MNIST images could be differentiated with only 1-pixel value. As a result, many other deep learning algorithms were not well utilised. So it was time to move ahead and generate more use cases. As a result, many drop-in replacements were made in MNIST to serve the data science practitioners better. 

Taking our dataset discussion ahead, today we’ll be talking about all those datasets which have proven to be very handy for data science practitioners.

FASHION MNIST

Developed in 2017 by Kashif Rasul, Han Xiao, and Roland Vollgraf collected from Zalando Research. The images are in a grayscale format of 28*28. The dataset contains 70000 images out of which 60000 training images and 10000 testing images. The dataset contains 10 classes labelled from 0 to 9 where 0 – Tshirt/top, 1 – Trouser, 2 –  Pullover, 3 – Dress, 4 – Coat, 5 – Sandal, 6 – Shirt, 7 – Sneaker, 8 – Bag, 9 – Ankle Boot. 

Dataset size: 36.42 MiB

Fashion MNIST was built as there are many modern Computer Vision problems MNIST cannot address.

Code Snippet

Using TensorFlow

import tensorflow_datasets as tfds
train,test = tfds.load('fashion_mnist', split=['train', 'test'])

Using PyTorch

import torch
import torchvision
from torchvision import transforms, datasets
train = datasets.Fashion_MNIST('', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor()
                       ]))
test = datasets.Fashion_MNIST('', train=False, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor()
                       ]))

You can visit this website to check various performance measures of Fashion MNIST.

MedMNIST

One of the very recent datasets developed in 2020 by Jiancheng Yang, Rui Shi, Bingbing Ni, Bilian Ke. MedMNIST has a collection of 10 medical open image datasets. The dataset contains 28 x 28 pixeled images which make it possible to use in any kind of machine learning algorithms as well as AutoML for medical image analysis and classification. The ten datasets used are – PathMNIST, ChestMNIST, DermaMNIST, OCTMNIST, PneumoniaMNIST, RetinaMNIST, OrganMNIST(axial, coronal, sagittal). The datasets have been trained on ResNet-18 and ResNet-50 baseline models. For AutoML it has been trained on AutoKeras, Auto-sklearn, and Google AutoML Vision. 

For entire code by MedMNIST creator, you can check this GitHub.

MEDICAL MNIST

Developed in 2017 by Arturo Polanco Lozano. This is also known as the MedNIST dataset for radiology and medical imaging. Images have been gathered from several datasets – at TCIA, the RSNA Bone Age Challenge, and the NIH Chest X-ray dataset.

The dataset contains 58954 medical images belonging to 6 classes – ChestCT(10000 images), BreastMRI(8954 images), CXR(10000 images), Hand(10000 images), HeadCT(10000 images), AbdomenCT(10000 images). Images are in the dimensions of 64×64 pixels. 

Dataset size: 75.98 MB

For entire code by NVIDIA Deep Learning Institute, you can check this notebook.

SIGN LANGUAGE MNIST

Developed in 2017, this dataset is taken from American Sign Language(ASL) which has almost the same as MNIST having 28*28 dimensions in grayscale. The dataset contains 27,455 training data and 7172 testing data to be classified into 24 classes. Dataset labels are from A to Y representing each hand gesture. Each data represents a label from 0 to 25 to be mapped for each alphabetic letter A-Z (except for 9=J or 25=Z). The dataset is present in Kaggle as CSV format storing each pixel value in rows(pixel1 to pixel784). 

Dataset Size: 100.9 MB

An implementation of this dataset using Keras library is present in this notebook.

Colorectal Histology MNIST

Developed in 2016, by multiple authors Kather, Francesco and Melchers, Jakob Nikolas and Weis, Susanne M and Schad, Alexander and Z{“o}llner, Lothar R and Gaiser, Cleo-Aron and Bianconi, Timo and Marx, Frank Gerrit. Multiclass classification for texture analysis in colorectal cancer histology belonging to 8 classes of tissues. There are two sets ColorectalHistology containing 5000 images of 150 x 150 x 3 in RGB another ColorectalHistologyLarge containing 10 large 5000 x 5000 pixels containing more than one type of tissue.

Dataset Size: 1.14 GB

Code Snippet

Using TensorFlow

For ColorectalHistology,

import tensorflow_datasets as tfds
train,test = tfds.load('ColorectalHistology', split=['train', 'test'])

For ColorectalHistologyLarge,

import tensorflow_datasets as tfds
train,test = tfds.load('ColorectalHistologyLarge', split=['train', 'test'])

Skin Cancer MNIST

Added from different sources this dataset contains dermatoscopic images of pigmented lesions was created in 2018. Developed by multiple authors Philipp Tschandl, Noel Codella, Veronica Rotemberg, M. Emre Celebi, Aadi Kalloo, Konstantinos Liopyris, Stephen Dusza, David Gutman, Brian Helba, Michael Marchetti, Harald Kittler, Allan Halpern.

This dataset is released by the HAM10000 (“Human Against Machine with 10000 training images”). It contains 10015 dermatoscopic images present in training set for academic machine learning research and available at the ISIC archive. 

It has 7 different classes of skin cancer which are – 1-Melanocytic nevi, 2 – Melanoma, 3 – Benign keratosis-like lesions, 4 -Basal cell carcinoma, 5 – Actinic keratoses, 6 – Vascular lesions, 7 – Dermatofibroma.

Dataset Size: 2.7 GB

An implementation of the above can be found in this notebook.

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Jayita Bhattacharyya

Jayita Bhattacharyya

Machine learning and data science enthusiast. Eager to learn new technology advances. A self-taught techie who loves to do cool stuff using technology for fun and worthwhile.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories