We receive a large number of emails everyday. They contain a mix of spam and non-spam emails. Automated email filtering for the classification of spam and non-spam emails is a very important application. Finding out the sentiment around a certain amount of text helps determine the perception of a product or an event or similar to the general public. This is termed as sentiment analysis and is another common text application. Twitter sentiment analysis is one of the most commonly seen applications to gauge the consumer sentiment due to a recent event or the launch of a certain product etc.
The kind of applications mentioned above are binary or multiclass classification problems i.e. they would have a possibility of two or more outcomes of an event. This kind of a problem tells us to go for some classification algorithms like Logistic Regression, Tree Based Algorithms, Support Vector Machines, Naive Bayes etc. When you actually get to work with the above algorithms, Naive Bayes gives you the best kind of results which are desired. In applications like spam filtering and sentiment analysis, the data majorly consists of the textual data in the form of reviews or the contents of an email.
Naive Bayesian algorithm is a simple classification algorithm which uses probability of the events for its purpose. It is based on the Bayes Theorem which assumes that there is no interdependence amongst the variables. For example, if a fruit is banana and it has to be yellow/green in colour, in the shape of a banana and 1-2cm in radius. All of the properties stated above contribute individually towards that fruit being a banana and hence these features are referred to as “Naive”. As it considered the feature set to be Naive, the Naive Bayesian algorithm can be trained using less training data and also mislabeled data.
The Bayes Theorem is based on the following formula :
P(A/B) =P(A) x P(B/A)P(B)
Here we are calculating posterior probability of the class A when predictor B is given to us ie. P(A/B). P(A) is the prior probability of the class. P(B/A) is the likelihood of predictor B given class A probability. P(B) is the prior probability of the predictor B. Calculating these probabilities will help us calculate probabilities of the words in the text.
The Bayesian statistics is different from the general statistics in various ways that a general probability calculation is always done around random events with a repeated number of trials while the Bayesian statistics is involved in calculating the prior and posterior probabilities. Bayesian statistics gives the leverage of the changing probabilities which can happen prior and post a certain event.
The Naive Bayesian classifier consists of performing the below steps –
- Create a frequency table based on the words
- Calculate the likelihood for each of the classes based on the frequency table
- Calculate the posterior probability for each class
- The highest posterior probability is the outcome of the prediction experiment
All these probabilities are calculated by using the Bayes Theorem. As the Naive Bayes algorithm has the assumption of the “Naive” features it performs much better than other algorithms like Logistic Regression, Tree based algorithms etc. The Naive Bayes classifier is much faster with its probability calculations.
Different kinds of Naive Bayesian implementations exist –
- Gaussian Naive Bayes
This is the kind of algorithm used when all features follow a normal distribution. All features are continuous valued. The assumption is that there is no covariance between the independent features.
- Multinomial Naive Bayes
It is generally used where there are discrete features(for example – word counts in a text classification problem). It generally works with the integer counts which are generated as frequency for each word. All features follow multinomial distribution. In such cases TF-IDF(Term Frequency, Inverse Document Frequency) also works.
- Bernoulli Naive Bayes
This classifier also works with discrete data. The major difference between Multinomial Naive Bayes and Bernoulli is that Multinomial Naive Bayes works with occurrence counts while Bernoulli works with binary/boolean features. For example, the feature values are of the form true/false, yes/no, 1/0 etc. This is best visualized with the help of a histogram.
Different variations of the Naive Bayes classifier all work with the same analogy of independence of features. The way the different types of Naive Bayesian classifiers have been designed they work very well on all kinds of text related problems. Document classification is one such example of a text classification problem which can be solved by using both Multinomial and Bernoulli Naive Bayes. The calculation of probabilities is the major reason for this algorithm to be a text classification friendly algorithm and a top favorite among the masses. This classifier is highly used for predictions in real-time and also used in recommendation systems along with collaborative filtering.