Classification is one of the most common modeling approaches when it comes to Supervised Learning on a discrete number of outputs. However, training a classification model and tuning it to our specific needs is a daunting task which often leaves us asking for help. In general, classification can be multi-class (having >2 classes), but here, for the sake of simplicity, we will stick to binary classification and all the concepts can be extrapolated to multi-class classification with ease. The topics that we will discuss in this article are the following:
Table of Contents
- How to judge a classification model?
- The need for ROC Curve
- How to read an ROC Curve
- ROC Curve of a No Skill Model
- ROC Curve of a Highly Skilled Model
- ROC Curve of a Perfect Model
- Finding optimal threshold from ROC Curve
How to judge a classification model?
In Binary Classification, we have input (X) and output {0, 1}. Most classification models give out a tuple containing 2 values between 0 and 1 (both included) which stands for the probability of the input (x) to belong to class 0 and 1 respectively. (Sum of the 2 values will always be 1 since we’re making the model classify only between 2 classes and there can’t be any 3rd class that the input may belong to.)
The model later applies a threshold (default threshold is 0.5) on the probability value and converts the continuous output in [0, 1] to a discrete output in {0, 1}. It is important to know this because, in certain problems, the default threshold of 0.5 doesn’t make sense and we might need a different threshold for the model to be useful.
The notion of Accuracy to judge a model is valid in a binary classification model only if the data is balanced (equal population in both classes). This is due to the fact that when 90% of the data is in class 0; if the model classifies all the input items as class 0, the accuracy is 90% (since the model is wrong only 10% of the time). This is very misleading and shouldn’t be used.
We actually want to measure that from the population which are actually class 1, what % were classified as class 1 (similarly for class 0); and from the population which are classified by the model as class 1, what % were actually class 1 (similarly for class 0).
Therefore, better measures of performance are Precision and Recall. In case you’ve never heard of them before, here is what they mean (showing only for class 1):
The recall is aka True Positive Rate (TPR)
The need for ROC Curve
As you might’ve figured out by now in the formulae above, “Predicted Class 1” totally depends on what threshold you choose on the probability values. For instance, a probability value of 0.75 on thresholds 0.5 and 0.8 will be classified as class 1 and class 0 respectively.
Now you might ask, isn’t a threshold of 0.8 too high to consider?
The answer is, ‘depends on what problem you’re solving!’
Let’s take an example where we might need such a high threshold. Say, you’ve to classify a transaction as Fraud (class 0) or Not Fraud (class 1).
Here, misclassifying a Fraud transaction as Not Fraud has an exponentially higher cost than misclassifying a Not Fraud transaction as Fraud. Identification of “all” the Fraud transactions is much more important to us even if some Not Fraud transactions get misclassified.
Here, naturally, we want to keep a high threshold for class 1 (and lower for class 0) to catch the fraud transaction even if our model gave out the probability of Class 1 (Not Fraud) as 0.75. So here, a threshold of 0.8 might actually be optimal.
Now coming to the point, ROC (Receiver Operating Characteristic) Curve helps us find this optimal threshold. It is a plot between True Positive Rate (Recall) and False Positive Rate for all the different threshold values.
False Positive Rate=False Positives/Total Negatives=False Positives/(False Positives + True Negatives)
How to read an ROC Curve
- On the ROC curve, each point corresponds to a different threshold, and its location corresponds to the resulting TPR and FPR when we choose that threshold.
- Note that there exists only a single ROC Curve for a model-dataset pair.
- There is a common misconception that people ask for ROC Curves for different thresholds! This is just wrong. A single ROC Curve encompasses “all possible thresholds” and thus it can be used to compare 2 different models’ performances on the same dataset!
Some important pointers on the curve:
- Point 1 corresponds to the threshold of 0
- Point 3 corresponds to the threshold of 1
- Rest of points (like Point 2) on the curve belong to other thresholds in the range (0, 1)
The Area under the ROC Curve (aka ROC-AUC) is a metric that helps us compare 2 similar looking but different ROC curves. Higher the AUC, better the performance (we’ll see why that is, a bit ahead).
ROC Curve of a ‘No Skill’ Model
As you can see in the figure, the ROC Curve of a No Skill Model (a model which gives 50% probability for all input items, hence the name, No Skill) is a straight line from (0, 0) to (1, 1). (Actually this line is made from only 3 points, namely, (0, 0), (0.5, 0.5), (1, 1).)
As evident, the AUC of this curve is the area of the triangle it forms (with the points (0, 0), (1, 1), (1, 0)). Hence the area is ½ * (base) * (height) = ½ * 1 * 1 = 0.5
ROC Curve of a Highly Skilled Model
As you can see in the figure, the ROC Curve of a Highly Skilled Model (a model which is able to correctly predict the classes for most of the input items, hence the name, Highly Skilled) is a curve from (0, 0) to (1, 1) bending towards (0, 1).
As evident, the AUC of this curve is close to the area of the square (with the points (0, 0), (1, 1), (1, 0), (0, 1)). Hence the area is close to (length)^2 = 1 * 1 = 1
The actual ROC-AUC of a highly skilled model maybe around 0.8-0.9, depending on the actual curve.
ROC Curve of a Perfect Model
As you can see in the figure, the ROC Curve of a Perfect Model (a model which is correct all the time) consists of just 3 points, namely, (0, 0), (1, 1), (0, 1).
As discussed earlier, Point 3 corresponds to threshold = 1 (meaning, we classify all the points as class 0, which makes both TPR and FPR 0, hence the location of the point).
Point 1 corresponds to threshold = 0 (meaning, we classify all the points as class 1, which makes both TPR and FPR as 1, hence the location of the point).
Point 2 corresponds to all the other thresholds in the range (0, 1) (since it’s a perfect model, for any threshold between 0 and 1, we get true outputs and hence TPR = 1, while FPR = 0)
Although this curve is technically not a ‘curve’ per se, if we consider it as a highly-skilled curve approaching the point (0, 1), we can say that the AUC approaches 1.
Hence, the target ROC-AUC for any model-dataset pair is 1 (not possible in practical situations due to random noise in real-life data).
Finding optimal threshold from ROC Curve
Since we now know that a highly skilled model’s ROC curve approaches the Perfect model point (i.e. the point (0, 1)); it is quite intuitive that the optimal threshold for our model is going to be corresponding to the point on the curve which is closest to the Perfect model point (0, 1).
Say, this optimal threshold is ????.
A threshold a little too high compared to ????, although will get you a better (lower) FPR, this decrease in FPR is going to be much smaller than the decrease in TPR that you’ll get as well (Notice that when you move towards a higher threshold than ????, you are moving down the curve towards (0, 0) and the decrease in TPR is much faster than the decrease in FPR)
A threshold a little too low compared to ????, although will get you a better (higher) TPR, this increase in TPR is going to be much smaller than the increase in FPR that you’ll get as well (Notice that when you move towards a lower threshold than ????, you are moving up the curve towards (1, 1) and the increase in FPR is much faster than the increase in TPR).
Hence, ???? is actually the sweet spot that balances TPR and FPR nicely.
I’ll tell you the easiest way to find this ????.
Maximize (TPR – FPR): the threshold for which (TPR – FPR) is maximum, is your ????.
Conclusion
Today, we’ve learnt all the basics of the ROC Curve and how it is used in classification problems and its utility! I hope now you won’t have any problem working with ROC Curves in your future classification models.