Kannada Makes It To The Machine Learning World With A Full-Fledged MNIST Dataset

The term MNIST is something that no machine learning enthusiast can avoid. MNIST is special in many ways, it is highly popular and hence widely explored and studied, it is open and easily available and it is not at all complicated. MNIST is one of the very first datasets any ML person would turn to when they are beginning. 

A recent research paper talks about a very new addition to the family of MNISTs. This new family with origin in India is a dataset of handwritten digits from Kannada, one of the 22 scheduled languages in India spoken by almost 57 million people.

What Is Kannada-MNIST

The dataset consists of images of handwritten digits in Kannada with 60,000 images in training set and 10,000 images in the test set.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Kannada numerals from 1 to 10

In addition to the training and test set, there is another set which consists of 10,240 images called the Dig-MNIST dataset. Unlike the Kannada MNIST which were handwritten by people who used Kannada as a means of communication, the Dig-MNIST is handwritten by non-Kannadigas, thus acting as a more challenging test set. The images in Dig-MNIST are noisier with smudges and grid borders.

Data set dimensions:

Download our Mobile App

  • Training set: 60,000 x 28 × 28
  • Test set: 10,000 x 28 x 28
  • Dig-MNIST: 10,240 28 × 28 

The Kannada-MNIST is to act as a complete replacement to the original MNIST dataset.  Although there have been numerous works around Kannada digits in ML, Kannada-MNIST purely addresses the scarcity in data with a count that is up to the original MNISTdataset Standard along with an additional Dig-MNIST dataset.

Kannada-MNIST vs MNIST

The paper also compares the Kannada-MNIST with the MNIST dataset. The paper describes how the two datasets differ in both Morphological and Dimensionality reduction comparisons. The Morphological comparison compares the pixel densities of the images in both the datasets. It was observed that the Kannada-MNIST dataset has a maximal mean pixel-intensity of ∼ 0.3 as compared to the ∼ 0.6 of the MNIST dataset. The statistics of morphological traits were obtained using the Morpho-MNIST framework.

Principle Component Analysis was used to understand the explained variance across the PCA components in which it was found that the top-50 PCA components explain 83% of the total variance for the MNIST dataset while it only explained 63% for Kannada-MNIST.

Classification Results

The research also studies the performance behaviour of the Kannada-MNIST dataset with a standard Convolutional Neural Network.


With an out-of-the-shelf Keras CNN using Adadelta optimizer with a learning-rate=1.0 and ρ = 0.95 the model was able to attain a 97.3% accuracy on the test set. The same model returned an accuracy of 76.2% on the Dig-MNIST dataset.

It Is Open

The work by Vinay Uday Prabhu has been open-sourced to promote future studies on both Kannada-MNIST as well as other languages. The paper also puts down some interesting problem statements or challenges to the large ML community to use the Kannada-MNIST dataset for various studies and researches.

Click here to read the full paper

Click here to go to the official Git.

What To Expect

MNIST has already become a standard turn-to dataset for beginners in machine learning, especially in Computer Vision. With more studies being done and more papers being published on MNIST, we can expect scripts of many more languages to enter into the MNIST family which will induce more challenges as well as new discoveries in the ML spectrum, setting up a new standard.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Amal Nair
A Computer Science Engineer turned Data Scientist who is passionate about AI and all related technologies. Contact: amal.nair@analyticsindiamag.com

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: From Promise to Peril: The Pros and Cons of Generative AI

Most people associate ‘Generative AI’ with some type of end-of-the-world scenario. In actuality, generative AI exists to facilitate your work rather than to replace it. Its applications are showing up more frequently in daily life. There is probably a method to incorporate generative AI into your work, regardless of whether you operate as a marketer, programmer, designer, or business owner.