Stemming and lemmatization are text normalisation techniques used in NLP. Essentially, both the techniques break down the words into their root forms. What’s the significance of stemming and lemmatization in NLP and how do they differ? Let’s find out.
Lemmatization entails reducing a word to its canonical or dictionary form. The root word is called a ‘lemma’.The method entails assembling the inflected parts of a word in a way that can be recognised as a single element. The process is similar to stemming but the root words have meaning.
Lemmatization has applications in:
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
- Biomedicine: Using lemmatization to parse biomedicine literature may increase the efficiency of data retrieval tasks.
- Search engines
- Compact indexing: Lemmatization is an efficient method for storing data in the form of index values.
For example, NLTK provides WordNetLemmatizer class– a slim cover wrapped around the wordnetCorpus. This class makes use of a function called Morphy() to the WordNetCorpusReader class to find a root word/lemma.
Lemmatizers need a lot more data on the structure of a language, which makes the creation of lemmatizers a harder process than making a stemming algorithm.
Stemming is a rule-based approach that produces variants of a root/base word. In simple words, it reduces a base word to its stem word. This heuristic process is the simpler of the two as the process involves indiscriminate cutting of the ends of the words. Stemming helps to shorten the look-up and normalise the sentences for a better understanding. The process has two main challenges:
- Over stemming: The inflected word is cut off so much that the resultant stem is nonsensical. Over stemming can also result in different words with different meanings having the same stem. For example, “universal”, “university” and “universe” is reduced to “univers”. Here, even though these three words are etymologically related, their modern meanings are widely different. Treating them as synonyms in a search engine will lead to inferior search results.
- Understemming: Here, various inflected words have the same stem despite different meanings. The issue crops up when we have several words that actually are forms of one another. An example of understemming in the Porter stemmer is “alumnus” → “alumnu”, “alumni” → “alumni”, “alumna”/”alumnae” → “alumna”. The English word has Latin morphology, and so these near-synonyms are not combined..
Lemmatization versus stemming
Both procedures involve same methodologies; that is reducing the inflectional forms of each word into a common base or root. However, the main difference is in the way they work and, therefore, the result each returns.
- Stemming is a faster process than lemmatization as stemming chops off the word irrespective of the context, whereas the latter is context-dependent.
- Stemming is a rule-based approach, whereas lemmatization is a canonical dictionary-based approach.
- Lemmatization has higher accuracy than stemming.
- Lemmatization is preferred for context analysis, whereas stemming is recommended when the context is not important.
To assess the performance of the processes, stemming and lemmatization were compared against the baseline technique used in NLP models provided with the CACM collection. Mean Average Precision (MAP) was used to evaluate document relevancy at the top 10 and 20 document levels.
Both stemming and lemmatization outperformed better than the baseline technique at both the document levels. This indicates that when queries are processed using language modeling techniques, they yield documents that are more relevant compared to queries which are not processed.
While stemmers are quick to create and run, lemmatizers provide a better quality of results or lower margins of error.
The key applications of these methodologies include:
- Information retrieval: Stemming and lemmatization can be used to map documents to general topics and provide search results by indexing.
- Document clustering: Stemming and lemmatization reduce the number of tokens to facilitate the transfer of same information. Here, features are estimated by determining the frequency of each token, and then clustering methods are applied.
- Sentiment analysis