Top NLP Libraries & Datasets For Indian Languages

Natural language processing has the potential to broaden the online access for Indian citizens due to significant advancements in high computing GPU machines, high-speed internet availability and increased use of smartphones. According to a survey, the consumers pointed out the benefits of the chatbots, among which 55% of people thought getting answers to simple questions was one of the significant benefits. Still, when it comes to India, that’s challenging as languages in India aren’t that simple.

As Indian languages pose many challenges for NLP like ambiguity, complexity, language grammar, translation problems, and obtaining the correct data for the NLP algorithms, it creates a lot of opportunities for NLP projects in India.

Below we look at some of the top NLP resources for Indian Languages:

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Top NLP libraries for Indian Languages

iNLTK (Natural Language Toolkit for Indic Languages)

iNLTK provides support for various NLP applications in Indic languages. The languages supported are Hindi (hi), Punjabi (pa), Sanskrit (sa), Gujarati (gu), Kannada (kn), Malayalam (ml), Nepali (ne), Odia (or), Marathi (mr), Bengali (bn), Tamil (ta), Urdu (ur), English (en).

iNLTK is like the NLTK Python package. It provides the feature for NLP tasks such as tokenisation and vector embedding for input text with an easy API interface.


Download our Mobile App



One has to first install; 

pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

Then next is installing iNLTK using pip:

pip install inltk

Indic NLP Library:

The Indian languages have some difficulties which come from sharing a lot of similarity in terms of script, phonology, language syntax, etc., and this library provides a general solution. 

Indic NLP Library provides functionalities like text normalisation, script normalisation, tokenisation, word segmentation, romanistion, indicisation, script conversion, transliteration and translation.

Languages supported:

  • Indo-aryan:

Assamese (asm), Bengali (ben), Gujarati (guj), Hindi/Urdu (hin/urd), Marathi (mar), Nepali (nep), Odiaa (ori), Punjabi (pan).

  • Dravidian:

Sindhi (snd), Sinhala (sin), Sanskrit (san), Konkani (kok), Kannada (kan), Malayalam (mal), Teugu (tel), Tami (tam).

  • Others:

English (eng).

Tasks handled:

  • It handles bilingual tasks like Script conversions for languages mentioned above except Urdu and English.
  • Monolingual tasks:
  • This language supports languages like Konkani, Sindhi, Telugu and some others which aren’t supported by iNLTK library.
  • Transliteration amongst the 18 above mentioned languages.
  • Translation amongst ten languages.

The library needs Python 2.7+, Indic NLP Resources (only for some modules) and Morfessor 2.0 Python Library.

Installation:

pip install indic-nlp-library

Next, download the resources folder which contains the models for different languages. 

# download the resource

git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git

StanfordNLP:

StanfordNLP contains tools which can be used to convert a string containing human language text into lists of words and sentences. This library converts the human language texts into lists to generate base forms of those words, parts of speech and morphological features, and also to give a syntactic structure dependency parse. This Syntactic structure dependency parse is designed to be parallel among more than 70 languages using the Universal Dependencies formalism.

The language inherits additional functionality from CoreNLP Java package such as constituency parsing, linguistic pattern matching and conference resolution.

The modules are built on top of PyTorch, and the package is a combination of software based on Stanford entry in the CoNLL 2018 Shared Task on Universal Dependency Parsing and Java Stanford CoreNLP software.

SantfordNLP offers features like:

  • Easy Native Python Implementation.
  • Complete neural network pipeline for better and easy text analytics which includes multi-word token (MWT) expansion, tokenisation, parts-of-speech (POS), lemmatisation, morphological features tagging and dependency parsing.
  • Stable Python interface to CoreNLP.
  • The neural network model has support for 53 human languages featured in 73 treebanks.

Install using pip, 

pip install stanfordnlp

Top datasets for NLP (Indian languages)

Semantic Relations from Wikipedia: Contains automatically extracted semantic relations from multilingual Wikipedia corpus.

HC Corpora (Old Newspapers): This dataset is a subset of HC Corpora newspapers containing around 16,806,041 sentences and paragraphs in 67 languages including Hindi.

Sentiment Lexicons for 81 Languages: This dataset contains positive and negative sentiment lexicons for 81 languages which also includes Hindi.

IIT Bombay English-Hindi Parallel Corpus: This dataset contains parallel corpus for English-Hindi and monolingual Hindi corpus. This dataset was developed ar the Center for Indian Language Technology.

Indic Languages Multilingual Parallel Corpus: This parallel corpus covers 7 Indic languages (in addition to English) like Bengali, Hindi, Malayalam, Tamil, Telugu, Sinhalese, Urdu.

Microsoft Speech Corpus (Indian languages)(Audio dataset): This corpus contains conversational, phrasal training and test data for Telugu, Gujarati and Tamil.

Hindi Speech Recognition Corpus(Audio Dataset): This is a corpus collected in India consisting of voices of 200 different speakers from different regions of the country. It also contains 100 pairs of daily spontaneous conversational speech data.

Support independent technology journalism

Get exclusive, premium content, ads-free experience & more

Rs. 299/month

Subscribe now for a 7-day free trial

More Great AIM Stories

Sameer Balaganur
Sameer is an aspiring Content Writer. Occasionally writes poems, loves food and is head over heels with Basketball.

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
AIM TOP STORIES

All you need to know about Graph Embeddings

Embeddings can be the subgroups of a group, similarly, in graph theory embedding of a graph can be considered as a representation of a graph on a surface, where points of that surface are made up of vertices and arcs are made up of edges