Complete Guide to Understanding Precision and Recall Curves

Share

Illustration by Notes on the Nuances of Leadership

Published on November 21, 2021

by Aishit Dharwal

Precision and Recall are two of the most important metrics to look at when evaluating an imbalanced classification model. These help us to find out what fraction of the actual positives were classified correctly and among the ones classified positive, what fraction were actually positives. Going ahead, we will try to explore the significance of these and thresholds in detail.

The topics that we will discuss in this article are the following:

What are Precision and Recall?
What is the Need for a Precision-Recall Curve?
How to read a PR Curve
The baseline of PR Curve
PR Curve of a ‘No Skill’ Model
PR Curve of a Perfect Model
PR Curve of a Good Model
PR Curve of a Bad Model
Finding optimal threshold from PR Curve

What are Precision and Recall?

Precision is defined as the fraction of the relevant positives out of the retrieved ones. Recall is defined as the fraction of retrieved positives out of the relevant ones.

To understand from an example, let’s imagine we’re casting a fishing net in a lake and hoping to catch some fish. However, there are some stones as well in the lake and it’s likely that our net will end up catching some stones as well. Now there can be a few different ways to look at this situation:

We may want to catch as many fish as possible from the lake, no matter how many stones we catch along with them.
We may want to catch only the fish and minimize the number of stones that we catch, no matter how few fish we catch.
Or, we may want to catch most of the fish present in the lake and at the same time minimize the number of stones caught.

Think of the fishing net as the model which gives out some outputs (preferably, fish). However, just like everything in life, no model is perfect and hence, our net catches some stones as well along with fish. The fish present in the lake is the Relevant outputs. The Retrieved output contains some fish and some stones.

Precision = fraction of fish among the retrieved stuff

Recall = fraction of fish retrieved from the lake

In Case 1, we want to maximize Recall and ignore Precision.

In Case 2, we want to maximize Precision and ignore Recall.

In Case 3, we want to keep a balance between Precision and Recall and try to maximize both at the same time. This is an ideal situation.

Now we can formally define Precision and Recall as follows:

What is the Need for a Precision-Recall Curve?

To classify an output to either class (0 or 1), we need to apply a threshold filter (just like the fishing net). For example, a default threshold of 0.5 is taken to classify outputs (any output >= 0.5 will belong to class 1). I have talked about how different thresholds might be useful for different kinds of problems here.

Coming to the question of why we need a PR curve when we already have the ROC curve (which plots True and False Positive Rates for different thresholds). Here are 2 reasons why we need PR curve:

ROC curve provides an overly optimistic picture of the performance, compared to PR curve, when it comes to imbalanced classification.
Also, when class distribution changes, ROC curve doesn’t change, however, PR curve does reflect the change.

How to read a PR Curve

Fig 1

On the curves, each point corresponds to a different threshold, and its location corresponds to the resulting Precision and Recall when we choose that threshold.

Some important pointers on the curve:

Point 1 corresponds to the threshold of 1
Point 3 corresponds to the threshold of 0
Point 4 corresponds to the threshold somewhere in the range (0, 1)
Point 2 corresponds to a Perfect model (along with Point 3)

The Area under the PR Curve is a metric that helps us compare 2 similar looking curves. Higher the AUC, the better the performance.

The baseline of PR Curve

Fig 2

Fig 3

The Baseline of a PR Curve changes with class imbalance, unlike ROC Curve. This is due to the fact that Precision of a No-Skill model (which gives 0.5 score for every output) directly depends on the class imbalance.

Fig 2 shows the baseline corresponding to a balanced dataset, whereas, Fig 3 shows the baseline corresponding to a dataset having a 10% positive class.

From here on, we will stick to a 10% positive class dataset (as is the case for so many real-life datasets).

PR Curve of a ‘No Skill’ Model

Fig 4

The PR curve of a no-skill model (which gives 0.5 output for every data point) consists of 2 points:

Point 1 Corresponds to threshold = 0.5
Point 2 Corresponds to threshold ???? [0, 0.5)

Precision isn’t defined for thresholds ???? (0.5, 1] for a no-skill model due to division by zero.
You may notice that Precision is a constant here, i.e., 0.1 (= class imbalance)
AUC = 0.1

PR Curve of a Perfect Model

Fig 5

The PR curve of a perfect model also consists of 2 points:

Point 1 Corresponds to threshold ???? (0, 1]
Point 2 Corresponds to threshold = 0

AUC = 1

PR Curve of a Good Model

Fig 6

The PR curve of a good model consists of as many points (or thresholds) as which result in a different set of precision and recall for the dataset:

Point 1 Corresponds to threshold = 1
Point 3 Corresponds to threshold = 0
Point 4 Corresponds to threshold ???? (0, 1)

AUC ???? (0.1, 1)

PR Curve of a Bad Model

Fig 7

The PR curve of a bad model goes even below the baseline. The curve indicates that the model performs even worse than a no-skill model.

An obvious way to improve the performance of this model, without tweaking anything, is to just reverse the output (Class 0 <-> Class 1) that the model gives. This will result in a higher-than-baseline performance automatically.
Usually, a PR curve like this indicates that there’s definitely something wrong in the pipeline.
- It can be the case that the data is too random.
- Or, the model just can’t grasp ANY trend from the data (in other words, the model is too simple for the data). An example can be a basic linear model trying to fit on a complex non-linear dataset.

AUC ???? (0, 0.1)
There might be a hybrid case where a bad model works better than baseline for certain thresholds.

Finding optimal threshold from PR Curve

Fig 8

Since we now know that a good model’s PR curve approaches the Perfect model point (i.e. the point 2); it is quite intuitive that the optimal threshold for our model is going to be corresponding to the point on the curve which is closest to Point 2.

Here are 2 ways to find the optimal threshold:

Find the euclidean distance of every point on the curve, which is denoted by (recall, precision) for a corresponding threshold, from (1,1).
1. Pick the point and the corresponding threshold, for which the distance is minimum.

Find F1 score for each point (recall, precision) and the point with the maximum F1 score is the desired optimal point.
1. You may recall (pun intended) that F1 score is the harmonic mean of Precision and Recall.

Conclusion

Some key pointers worth noting:

Recall of a No-Skill model lies in the set {0.5, 1} irrespective of the class imbalance.
Precision of a No-Skill model is equal to the fraction of positive class in the dataset.
It is possible to get a model which performs worse than a no-skill model, especially when the data is too complex for the model.
Just like the ROC curve, a PR curve can also be used to find the optimal threshold.
PR curve works better than ROC curve in cases of imbalanced data.

Today, we’ve learnt all the basics of PR Curve and how it is used in classification problems and its utility! I hope now you won’t have any problem working with PR Curves in your future classification models.