Top NLP Libraries & Datasets For Indian Languages

Natural language processing has the potential to broaden the online access for Indian citizens due to significant advancements in high computing GPU machines, high-speed internet availability and increased use of smartphones. According to a survey, the consumers pointed out the benefits of the chatbots, among which 55% of people thought getting answers to simple questions was one of the significant benefits. Still, when it comes to India, that’s challenging as languages in India aren’t that simple.

As Indian languages pose many challenges for NLP like ambiguity, complexity, language grammar, translation problems, and obtaining the correct data for the NLP algorithms, it creates a lot of opportunities for NLP projects in India.

Below we look at some of the top NLP resources for Indian Languages:

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Top NLP libraries for Indian Languages

iNLTK (Natural Language Toolkit for Indic Languages)

iNLTK provides support for various NLP applications in Indic languages. The languages supported are Hindi (hi), Punjabi (pa), Sanskrit (sa), Gujarati (gu), Kannada (kn), Malayalam (ml), Nepali (ne), Odia (or), Marathi (mr), Bengali (bn), Tamil (ta), Urdu (ur), English (en).

iNLTK is like the NLTK Python package. It provides the feature for NLP tasks such as tokenisation and vector embedding for input text with an easy API interface.

Download our Mobile App

One has to first install; 

pip install torch==1.3.1+cpu -f

Then next is installing iNLTK using pip:

pip install inltk

Indic NLP Library:

The Indian languages have some difficulties which come from sharing a lot of similarity in terms of script, phonology, language syntax, etc., and this library provides a general solution. 

Indic NLP Library provides functionalities like text normalisation, script normalisation, tokenisation, word segmentation, romanistion, indicisation, script conversion, transliteration and translation.

Languages supported:

  • Indo-aryan:

Assamese (asm), Bengali (ben), Gujarati (guj), Hindi/Urdu (hin/urd), Marathi (mar), Nepali (nep), Odiaa (ori), Punjabi (pan).

  • Dravidian:

Sindhi (snd), Sinhala (sin), Sanskrit (san), Konkani (kok), Kannada (kan), Malayalam (mal), Teugu (tel), Tami (tam).

  • Others:

English (eng).

Tasks handled:

  • It handles bilingual tasks like Script conversions for languages mentioned above except Urdu and English.
  • Monolingual tasks:
  • This language supports languages like Konkani, Sindhi, Telugu and some others which aren’t supported by iNLTK library.
  • Transliteration amongst the 18 above mentioned languages.
  • Translation amongst ten languages.

The library needs Python 2.7+, Indic NLP Resources (only for some modules) and Morfessor 2.0 Python Library.


pip install indic-nlp-library

Next, download the resources folder which contains the models for different languages. 

# download the resource

git clone


StanfordNLP contains tools which can be used to convert a string containing human language text into lists of words and sentences. This library converts the human language texts into lists to generate base forms of those words, parts of speech and morphological features, and also to give a syntactic structure dependency parse. This Syntactic structure dependency parse is designed to be parallel among more than 70 languages using the Universal Dependencies formalism.

The language inherits additional functionality from CoreNLP Java package such as constituency parsing, linguistic pattern matching and conference resolution.

The modules are built on top of PyTorch, and the package is a combination of software based on Stanford entry in the CoNLL 2018 Shared Task on Universal Dependency Parsing and Java Stanford CoreNLP software.

SantfordNLP offers features like:

  • Easy Native Python Implementation.
  • Complete neural network pipeline for better and easy text analytics which includes multi-word token (MWT) expansion, tokenisation, parts-of-speech (POS), lemmatisation, morphological features tagging and dependency parsing.
  • Stable Python interface to CoreNLP.
  • The neural network model has support for 53 human languages featured in 73 treebanks.

Install using pip, 

pip install stanfordnlp

Top datasets for NLP (Indian languages)

Semantic Relations from Wikipedia: Contains automatically extracted semantic relations from multilingual Wikipedia corpus.

HC Corpora (Old Newspapers): This dataset is a subset of HC Corpora newspapers containing around 16,806,041 sentences and paragraphs in 67 languages including Hindi.

Sentiment Lexicons for 81 Languages: This dataset contains positive and negative sentiment lexicons for 81 languages which also includes Hindi.

IIT Bombay English-Hindi Parallel Corpus: This dataset contains parallel corpus for English-Hindi and monolingual Hindi corpus. This dataset was developed ar the Center for Indian Language Technology.

Indic Languages Multilingual Parallel Corpus: This parallel corpus covers 7 Indic languages (in addition to English) like Bengali, Hindi, Malayalam, Tamil, Telugu, Sinhalese, Urdu.

Microsoft Speech Corpus (Indian languages)(Audio dataset): This corpus contains conversational, phrasal training and test data for Telugu, Gujarati and Tamil.

Hindi Speech Recognition Corpus(Audio Dataset): This is a corpus collected in India consisting of voices of 200 different speakers from different regions of the country. It also contains 100 pairs of daily spontaneous conversational speech data.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Sameer Balaganur
Sameer is an aspiring Content Writer. Occasionally writes poems, loves food and is head over heels with Basketball.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: From Promise to Peril: The Pros and Cons of Generative AI

Most people associate ‘Generative AI’ with some type of end-of-the-world scenario. In actuality, generative AI exists to facilitate your work rather than to replace it. Its applications are showing up more frequently in daily life. There is probably a method to incorporate generative AI into your work, regardless of whether you operate as a marketer, programmer, designer, or business owner.