Now Reading
Top NLP Libraries & Datasets For Indian Languages

Top NLP Libraries & Datasets For Indian Languages

Natural language processing has the potential to broaden the online access for Indian citizens due to significant advancements in high computing GPU machines, high-speed internet availability and increased use of smartphones. According to a survey, the consumers pointed out the benefits of the chatbots, among which 55% of people thought getting answers to simple questions was one of the significant benefits. Still, when it comes to India, that’s challenging as languages in India aren’t that simple.

As Indian languages pose many challenges for NLP like ambiguity, complexity, language grammar, translation problems, and obtaining the correct data for the NLP algorithms, it creates a lot of opportunities for NLP projects in India.

Register for our upcoming Masterclass>>

Below we look at some of the top NLP resources for Indian Languages:

Top NLP libraries for Indian Languages

iNLTK (Natural Language Toolkit for Indic Languages)

iNLTK provides support for various NLP applications in Indic languages. The languages supported are Hindi (hi), Punjabi (pa), Sanskrit (sa), Gujarati (gu), Kannada (kn), Malayalam (ml), Nepali (ne), Odia (or), Marathi (mr), Bengali (bn), Tamil (ta), Urdu (ur), English (en).

iNLTK is like the NLTK Python package. It provides the feature for NLP tasks such as tokenisation and vector embedding for input text with an easy API interface.

Looking for a job change? Let us help you.

One has to first install; 

pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

Then next is installing iNLTK using pip:

pip install inltk

Indic NLP Library:

The Indian languages have some difficulties which come from sharing a lot of similarity in terms of script, phonology, language syntax, etc., and this library provides a general solution. 

Indic NLP Library provides functionalities like text normalisation, script normalisation, tokenisation, word segmentation, romanistion, indicisation, script conversion, transliteration and translation.

Languages supported:

  • Indo-aryan:

Assamese (asm), Bengali (ben), Gujarati (guj), Hindi/Urdu (hin/urd), Marathi (mar), Nepali (nep), Odiaa (ori), Punjabi (pan).

  • Dravidian:

Sindhi (snd), Sinhala (sin), Sanskrit (san), Konkani (kok), Kannada (kan), Malayalam (mal), Teugu (tel), Tami (tam).

  • Others:

English (eng).

Tasks handled:

  • It handles bilingual tasks like Script conversions for languages mentioned above except Urdu and English.
  • Monolingual tasks:
  • This language supports languages like Konkani, Sindhi, Telugu and some others which aren’t supported by iNLTK library.
  • Transliteration amongst the 18 above mentioned languages.
  • Translation amongst ten languages.

The library needs Python 2.7+, Indic NLP Resources (only for some modules) and Morfessor 2.0 Python Library.

Installation:

pip install indic-nlp-library

Next, download the resources folder which contains the models for different languages. 

# download the resource

git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git

StanfordNLP:

StanfordNLP contains tools which can be used to convert a string containing human language text into lists of words and sentences. This library converts the human language texts into lists to generate base forms of those words, parts of speech and morphological features, and also to give a syntactic structure dependency parse. This Syntactic structure dependency parse is designed to be parallel among more than 70 languages using the Universal Dependencies formalism.

The language inherits additional functionality from CoreNLP Java package such as constituency parsing, linguistic pattern matching and conference resolution.

The modules are built on top of PyTorch, and the package is a combination of software based on Stanford entry in the CoNLL 2018 Shared Task on Universal Dependency Parsing and Java Stanford CoreNLP software.

SantfordNLP offers features like:

  • Easy Native Python Implementation.
  • Complete neural network pipeline for better and easy text analytics which includes multi-word token (MWT) expansion, tokenisation, parts-of-speech (POS), lemmatisation, morphological features tagging and dependency parsing.
  • Stable Python interface to CoreNLP.
  • The neural network model has support for 53 human languages featured in 73 treebanks.

Install using pip, 

pip install stanfordnlp

Top datasets for NLP (Indian languages)

Semantic Relations from Wikipedia: Contains automatically extracted semantic relations from multilingual Wikipedia corpus.

HC Corpora (Old Newspapers): This dataset is a subset of HC Corpora newspapers containing around 16,806,041 sentences and paragraphs in 67 languages including Hindi.

Sentiment Lexicons for 81 Languages: This dataset contains positive and negative sentiment lexicons for 81 languages which also includes Hindi.

IIT Bombay English-Hindi Parallel Corpus: This dataset contains parallel corpus for English-Hindi and monolingual Hindi corpus. This dataset was developed ar the Center for Indian Language Technology.

Indic Languages Multilingual Parallel Corpus: This parallel corpus covers 7 Indic languages (in addition to English) like Bengali, Hindi, Malayalam, Tamil, Telugu, Sinhalese, Urdu.

Microsoft Speech Corpus (Indian languages)(Audio dataset): This corpus contains conversational, phrasal training and test data for Telugu, Gujarati and Tamil.

Hindi Speech Recognition Corpus(Audio Dataset): This is a corpus collected in India consisting of voices of 200 different speakers from different regions of the country. It also contains 100 pairs of daily spontaneous conversational speech data.

What Do You Think?

Join Our Discord Server. Be part of an engaging online community. Join Here.


Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top