Performances of machine learning models are obtained by testing them. We use many statistical tests but also one thing that we all are aware of is that no statistical test is perfect. Some errors in models are easy to understand but hard to capture. The base rate fallacy can be considered an easy to understand but hard to find error. The concept of base rate fallacy is taken from behavioral science. In this article, we are going to discuss the this fallacy and we will also understand its applicability to machine learning. The major points to be discussed in the article are listed below.
Table of contents
- What is the base rate?
- What is the base rate fallacy?
- Base rate fallacy in machine learning
- Why does the base rate Fallacy happen?
- How to avoid the base rate fallacy?
Let’s start by understanding the base rate first.
Sign up for your weekly dose of what's up in emerging technology.
What is the base rate?
In statistics, the base rate can be considered as probabilities of classes that are unconditioned of evidence of features. We may also think of the base rate as prior probabilities. We can understand it using the example of engineers in the world. So if 2% of the people are engineers in this world then the base rate of engineers is simply 1%.
In many statistical analyses, we find that the base rate is difficult to compare. Let’s say 2000 people beat the covid-19 using any kind of treatment. It will seem like a good figure until we don’t look at the whole population that has gone through a similar kind of treatment. Let’s say we find out that the base rate of the success of treatment is only 1/50 which means only 2000 people are successful in defeating covid using the treatment while it is applied to 100000 people. This is such a crucial figure and this is how we get a clearer report about the treatment using the base rate.
By the above example, we can understand how base-rate information is an important thing while performing statistical analysis. Not using a base rate in statistical analysis can be called a base rate fallacy. Let’s see what the base rate fallacy is.
Are you looking for a complete repository of Python libraries used in data science, check out here.
What is the base rate fallacy?
In general meaning, we can say that fallacy can be defined as the use of faulty reasoning, wrong moves, or invalid moves while building an argument. We can say that it will seem stronger than its actual strength.
The base rate fallacy is also a kind of fallacy that is also known as base rate bias and base rate neglect. This kind of fallacy has information about the base rate and specific information. There can be ignorance of base rate data in favor of individuating data. We can also consider the fallacy as a part of extension neglect.
Base rate fallacy in machine learning
In the above, we have discussed that this fallacy is something related to ignoring the information and we know about machine learning that the models under it work based on the information(we can also say the information is data). Let’s take an example of classification models where we use the confusion matrix to describe the performance of classification models.
The process of making a confusion matrix is followed by testing the model on the test data and the confusion matrix tells us about the number of right predictions and wrong predictions from the model. In the confusion matrix, the false-negative paradox and the false-positive paradox are examples of base rate fallacy.
Let’s say that there is a machine learning model for facial recognition of happy people resulting in more false-positive test results than true positives. We want the model to predict 99% accurately and analyze 1000 people every day, judging it by the number of tests then higher accuracy can be outweighed and the final result will determine far more false positives than true.
We can measure the probability of positive results by the accuracy of the test and the quality of the sampled population. We can say in summary that if the portion that is given with some condition is lower the false positive rate will give more false than positive if base rate fallacy is there.
Let’s understand it by an example in which a model is applied to classify a population of 1000 samples, the model is telling that 40% are of class A and provides a false positive rate of 5% and zero false-negative rates.
From class A and positive samples
1000 X (40/100) = 400, these samples are receiving true positive
Class B and negative samples
1000 X [(100 – 40)/100] X 0.05 = 30, These samples will receive a false positive
So 1000 – (400 + 30) = 570 samples are negative
The final accuracy measure will be
400/(30+400) = 93%
The confusion matrix will look as follows:
|Class A||400(true positive)||30(false positive)||430|
|Class B||0(false negative)||570(true negative)||570|
Let’s say it is applied on different 1000 samples where only 2% is a sample from class A then the confusion matrix will look like as follows
|Class A||20(true positive)||49(false positive)||69|
|Class B||0(false negative)||931(true negative)||931|
In this case, we can say that 20 of 69 samples are predicted right. So the probability of the model predicting right will be 29% for a similar test that results as 93% accurate.
Why does the base rate fallacy happen?
In studies, we can find out a number of reasons behind the presence of fallacy, and they all are related to a matter of relevance, that says we ignore the base rate information. Most of the time base rate information is classified as irrelevant and ignores its preprocessing. Sometimes we also find that the representative heuristic becomes the reason for the base rate fallacy.
How to avoid the base rate fallacy?
As discussed above, ignoring base rate information causes the base rate fallacy and we can also avoid base rate fallacy by paying attention to the base rate information. We may also need to understand what samples are there that are not as reliable predictors as we are thinking about them.
We are required to put more effort when we are measuring the probability of an event occurring. Bayesian methods help us in measuring the probability distribution of uplift and become a way to reduce the base rate fallacy.
In this article, we have discussed the base rate fallacy that can be found in the results of the models while they are used for making predictions and occurs because of ignoring base rate information. Along with this, we have discussed how this fallacy occurs and how we can avoid it.