Top 10 R Packages For Natural Language Processing (NLP)

Ambika Choudhury

R is one of the popular languages for statistical computing among developers and statisticians. According to our latest report, R is the second most-preferred programming language among data scientists and practitioners after Python. The language ruled the preference scale, with a combined figure of 81.9 percent utilisation for statistical modelling among those surveyed.

Below is the list of top ten packages for NLP in R language one must know.

(The list is in alphabetical order).

1| koRpus

koRpus is an R package for analysing texts. It includes a diverse collection of functions for automatic language detection. It also includes indices of lexical diversity, such as type token ratio, MTLD, etc. koRpus’ also provides a plugin for R GUI as well as IDE RKWard that assists in providing graphical dialogs for its basic features. 

2| lsa

Latent Semantic Analysis or lsa is an R package that provides routines for performing a latent semantic analysis with R. The basic idea of this package is that text do have a higher-order or latent semantic structure which is obscured by word usage e.g. through the use of synonyms or polysemy.

3| OpenNLP

OpenNLP provides an R interface to Apache OpenNLP, which is a collection of natural language processing tools written in Java. OpenNLP supports common natural language processing tasks such as tokenisation, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing and coreference resolution.

4| Quanteda

Quanteda is an R package for managing and analysing text. It is a fast, flexible, and comprehensive framework for quantitative text analysis in R. Quanteda provides functionality for corpus management, creating and manipulating tokens and ngrams, exploring keywords in context, forming and manipulating sparse matrices of documents by features and more.

5| RWeka

RWeka is an interface to Weka, which is a collection of machine learning algorithms for data mining tasks written in Java. It contains tools for data pre-processing, clustering, association rules, visualisation and more. This package contains an interface code, known as the Weka jar that resides in a separate package called ‘RWekajars’.

6| Spacyr

Spacyr is an R wrapper to the Python spaCy NLP library. The package is designed to provide easy access to the functionality of spaCy library in a simple format. One of the easiest methods to install spaCy and spacyr is through the spacyr function spacy_install(). 

7| Stringr

Stringr is a consistent, simple and easy to use R package that provides consistent wrappers for the string package and therefore simplifies the manipulation of character strings in R. It includes a set of internally consistent tools for working with character strings, i.e. sequences of characters surrounded by quotation marks.  

8| Text2vec 

Text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP). Some of its important features include allowing users to easily solve complex tasks, maximise efficiency per single thread, transparently scale to multiple threads on multicore machines, use streams and iterators, among others.

9| TM

TM or Text Mining Package is a framework for text mining applications within R. The package provides a set of predefined sources, such as DirSource, DataframeSource, etc. which handle a directory, a vector interpreting each component as a document, or data frame like structures (such as CSV files), and more.

10| Wordcloud

Wordcloud is an R package that creates pretty word clouds, visualises differences and similarity between documents, and avoids overplotting in scatter plots with text. The word cloud is a commonly used plot to visualise a speech or set of documents in a clear way. 

