MITB Banner

Top NLP Libraries & Datasets For Indian Languages

Share

Natural language processing has the potential to broaden the online access for Indian citizens due to significant advancements in high computing GPU machines, high-speed internet availability and increased use of smartphones. According to a survey, the consumers pointed out the benefits of the chatbots, among which 55% of people thought getting answers to simple questions was one of the significant benefits. Still, when it comes to India, that’s challenging as languages in India aren’t that simple.

As Indian languages pose many challenges for NLP like ambiguity, complexity, language grammar, translation problems, and obtaining the correct data for the NLP algorithms, it creates a lot of opportunities for NLP projects in India.

Below we look at some of the top NLP resources for Indian Languages:

Top NLP libraries for Indian Languages

iNLTK (Natural Language Toolkit for Indic Languages)

iNLTK provides support for various NLP applications in Indic languages. The languages supported are Hindi (hi), Punjabi (pa), Sanskrit (sa), Gujarati (gu), Kannada (kn), Malayalam (ml), Nepali (ne), Odia (or), Marathi (mr), Bengali (bn), Tamil (ta), Urdu (ur), English (en).

iNLTK is like the NLTK Python package. It provides the feature for NLP tasks such as tokenisation and vector embedding for input text with an easy API interface.

One has to first install; 

pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

Then next is installing iNLTK using pip:

pip install inltk

Indic NLP Library:

The Indian languages have some difficulties which come from sharing a lot of similarity in terms of script, phonology, language syntax, etc., and this library provides a general solution. 

Indic NLP Library provides functionalities like text normalisation, script normalisation, tokenisation, word segmentation, romanistion, indicisation, script conversion, transliteration and translation.

Languages supported:

  • Indo-aryan:

Assamese (asm), Bengali (ben), Gujarati (guj), Hindi/Urdu (hin/urd), Marathi (mar), Nepali (nep), Odiaa (ori), Punjabi (pan).

  • Dravidian:

Sindhi (snd), Sinhala (sin), Sanskrit (san), Konkani (kok), Kannada (kan), Malayalam (mal), Teugu (tel), Tami (tam).

  • Others:

English (eng).

Tasks handled:

  • It handles bilingual tasks like Script conversions for languages mentioned above except Urdu and English.
  • Monolingual tasks:
  • This language supports languages like Konkani, Sindhi, Telugu and some others which aren’t supported by iNLTK library.
  • Transliteration amongst the 18 above mentioned languages.
  • Translation amongst ten languages.

The library needs Python 2.7+, Indic NLP Resources (only for some modules) and Morfessor 2.0 Python Library.

Installation:

pip install indic-nlp-library

Next, download the resources folder which contains the models for different languages. 

# download the resource

git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git

StanfordNLP:

StanfordNLP contains tools which can be used to convert a string containing human language text into lists of words and sentences. This library converts the human language texts into lists to generate base forms of those words, parts of speech and morphological features, and also to give a syntactic structure dependency parse. This Syntactic structure dependency parse is designed to be parallel among more than 70 languages using the Universal Dependencies formalism.

The language inherits additional functionality from CoreNLP Java package such as constituency parsing, linguistic pattern matching and conference resolution.

The modules are built on top of PyTorch, and the package is a combination of software based on Stanford entry in the CoNLL 2018 Shared Task on Universal Dependency Parsing and Java Stanford CoreNLP software.

SantfordNLP offers features like:

  • Easy Native Python Implementation.
  • Complete neural network pipeline for better and easy text analytics which includes multi-word token (MWT) expansion, tokenisation, parts-of-speech (POS), lemmatisation, morphological features tagging and dependency parsing.
  • Stable Python interface to CoreNLP.
  • The neural network model has support for 53 human languages featured in 73 treebanks.

Install using pip, 

pip install stanfordnlp

Top datasets for NLP (Indian languages)

Semantic Relations from Wikipedia: Contains automatically extracted semantic relations from multilingual Wikipedia corpus.

HC Corpora (Old Newspapers): This dataset is a subset of HC Corpora newspapers containing around 16,806,041 sentences and paragraphs in 67 languages including Hindi.

Sentiment Lexicons for 81 Languages: This dataset contains positive and negative sentiment lexicons for 81 languages which also includes Hindi.

IIT Bombay English-Hindi Parallel Corpus: This dataset contains parallel corpus for English-Hindi and monolingual Hindi corpus. This dataset was developed ar the Center for Indian Language Technology.

Indic Languages Multilingual Parallel Corpus: This parallel corpus covers 7 Indic languages (in addition to English) like Bengali, Hindi, Malayalam, Tamil, Telugu, Sinhalese, Urdu.

Microsoft Speech Corpus (Indian languages)(Audio dataset): This corpus contains conversational, phrasal training and test data for Telugu, Gujarati and Tamil.

Hindi Speech Recognition Corpus(Audio Dataset): This is a corpus collected in India consisting of voices of 200 different speakers from different regions of the country. It also contains 100 pairs of daily spontaneous conversational speech data.

PS: The story was written using a keyboard.
Picture of Sameer Balaganur

Sameer Balaganur

Sameer is an aspiring Content Writer. Occasionally writes poems, loves food and is head over heels with Basketball.
Related Posts

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed