Naive Bayes – Why Is It Favoured For Text Related Tasks?

We receive a large number of emails everyday. They contain a mix of spam and non-spam emails. Automated email filtering for the classification of spam and non-spam emails is a very important application. Finding out the sentiment around a certain amount of text helps determine the perception of a product or an event or similar to the general public. This is termed as sentiment analysis and is another common text application. Twitter sentiment analysis is one of the most commonly seen applications to gauge the consumer sentiment due to a recent event or the launch of a certain product etc.

The kind of applications mentioned above are binary or multiclass classification problems i.e. they would have a possibility of two or more outcomes of an event. This kind of a problem tells us to go for some classification algorithms like Logistic Regression, Tree Based Algorithms, Support Vector Machines, Naive Bayes etc. When you actually get to work with the above algorithms, Naive Bayes gives you the best kind of results which are desired. In applications like spam filtering and sentiment analysis, the data majorly consists of the textual data in the form of reviews or the contents of an email. 

Naive Bayesian algorithm is a simple classification algorithm which uses probability of the events for its purpose. It is based on the Bayes Theorem which assumes that there is no interdependence amongst the variables. For example, if a fruit is banana and it has to be yellow/green in colour, in the shape of a banana and 1-2cm in radius. All of the properties stated above contribute individually towards that fruit being a banana and hence these features are referred to as “Naive”. As it considered the feature set to be Naive, the Naive Bayesian algorithm can be trained using less training data and also mislabeled data. 

The Bayes Theorem is based on the following formula :

P(A/B) =P(A) x P(B/A)P(B)

Here we are calculating posterior probability of the class A when predictor B is given to us ie. P(A/B). P(A) is the prior probability of the class. P(B/A) is the likelihood of predictor B given class A probability. P(B) is the prior probability of the predictor B. Calculating these probabilities will help us calculate probabilities of the words in the text.

The Bayesian statistics is different from the general statistics in various ways that a general probability calculation is always done around random events with a repeated number of trials while the Bayesian statistics is involved in calculating the prior and posterior probabilities. Bayesian statistics gives the leverage of the changing probabilities which can happen prior and post a certain event. 

The Naive Bayesian classifier consists of performing the below steps –

  • Create a frequency table based on the words
  • Calculate the likelihood for each of the classes based on the frequency table
  • Calculate the posterior probability for each class
  • The highest posterior probability is the outcome of the prediction experiment

All these probabilities are calculated by using the Bayes Theorem. As the Naive Bayes algorithm has the assumption of the “Naive” features it performs much better than other algorithms like Logistic Regression, Tree based algorithms etc. The Naive Bayes classifier is much faster with its probability calculations. 

Different kinds of Naive Bayesian implementations exist – 

  • Gaussian Naive Bayes

This is the kind of algorithm used when all features follow a normal distribution. All features are continuous valued. The assumption is that there is no covariance between the independent features.

  • Multinomial Naive Bayes

It is generally used where there are discrete features(for example – word counts in a text classification problem). It generally works with the integer counts which are generated as frequency for each word. All features follow multinomial distribution. In such cases TF-IDF(Term Frequency, Inverse Document Frequency) also works.

  • Bernoulli Naive Bayes

This classifier also works with discrete data. The major difference between Multinomial Naive Bayes and Bernoulli is that Multinomial Naive Bayes works with occurrence counts while Bernoulli works with binary/boolean features. For example, the feature values are of the form true/false, yes/no, 1/0 etc. This is best visualized with the help of a histogram.

Different variations of the Naive Bayes classifier all work with the same analogy of independence of features. The way the different types of Naive Bayesian classifiers have been designed they work very well on all kinds of text related problems. Document classification is one such example of a text classification problem which can be solved by using both Multinomial and Bernoulli Naive Bayes. The calculation of probabilities is the major reason for this algorithm to be a text classification friendly algorithm and a top favorite among the masses. This classifier is highly used for predictions in real-time and also used in recommendation systems along with collaborative filtering.

Download our Mobile App

Ekta Shah
An engineer at the core, data science is my passion. I have a Masters in Data Science from NMIMS. I have worked on machine learning problems, image classification and reinforcement learning problems. Solving complex problems and thinking of easy solutions is what I practice. Avid reader and writer describe me the best.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox