Now Reading
Python Code for Evaluation Metrics in ML/AI for Classification Problems

Python Code for Evaluation Metrics in ML/AI for Classification Problems

Evaluation of a machine learning model is crucial to measure its performance. Numerous metrics are used in the evaluation of a machine learning model. Selection of the most suitable metrics is important to fine-tune a model based on its performance. In this article, we discuss the mathematical background and application of evaluation metrics in classification problems.

We can start discussing evaluation metrics by building a machine learning classification model. Here breast cancer data from sklearn’s in-built datasets is used to build a random forest binary classification model. 

Import necessary libraries and packages to prepare the required environment.

 from sklearn.datasets import load_breast_cancer
 from sklearn.ensemble import RandomForestClassifier
 from sklearn.model_selection import train_test_split
 from sklearn import metrics
 import pandas as pd
 import numpy as np
 from matplotlib import pyplot as plt
 import seaborn as sns

Load data, split it into train-test set, build and train the model, and make predictions on test data.

 # choose a binary classification problem
 data = load_breast_cancer()
 # develop predictors X and target y dataframes
 X = pd.DataFrame(data['data'], columns=data['feature_names'])
 y = abs(pd.Series(data['target'])-1)
 # split data into train and test set in 80:20 ratio
 X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=1)
 # build a RF model with default parameters
 model = RandomForestClassifier(random_state=1), y_train)
 preds = model.predict(X_test) 

Confusion Matrix

Without a clear understanding of the confusion matrix, it is hard to proceed with any of classification evaluation metrics. The confusion matrix provides a base to define and develop any of the evaluation metrics. Before discussing the confusion matrix, it is important to know the classes in the dataset and their distribution.

 y.value_counts().plot.pie(ylabel=' ', autopct = '%0.1f%%')
 plt.title(f'0 - Not cancerous (negative)\n 1 - Cancerous (positive)        ', size=14, c='green')

There are two classes in the dataset. 0 refers to ‘Benign’: a non-cancerous state, we simply denote it as ‘negative’. 1 refers to ‘Malignant’: a cancerous state, we simply denote it as ‘positive’. In the dataset, there are 357 negative cases and 212 positive cases. It is clear that class distribution is highly imbalanced

Knowledge of the following terms will be of more use to proceed further with metrics.

True Positive: Actually positive (ground truth), predicted as positive (correctly classified)

True Negative: Actually negative (ground truth), predicted as negative (correctly classified)

False Positive: Actually negative (ground truth), predicted as positive (misclassified)

False Negative: Actually positive (ground truth), predicted as negative (misclassified)

To plot a confusion matrix,

metrics.plot_confusion_matrix(model, X_test, y_test, display_labels=['Negative', 'Positive'])


Here out of 114 total test samples, 72 are True Negatives (TN), 37 are True Positives (TP), 5 are False Negatives (FN), and there are no False Positives (FP).

 confusion = metrics.confusion_matrix(y_test, preds)

yields the output array([72,  0,  5, 37])

Most of the evaluation metrics are defined with the terms found in the confusion matrix. 


Accuracy can also be defined as the ratio of the number of correctly classified cases to the total of cases under evaluation. The best value of accuracy is 1 and the worst value is 0.

In python, the following code calculates the accuracy of the machine learning model.

 accuracy = metrics.accuracy_score(y_test, preds)

It gives 0.956 as output. However, care should be taken while using accuracy as a metric because it gives biased results for data with unbalanced classes. We discussed that our data is highly unbalanced, hence the accuracy score may be a biased one!


Precision can be defined with respect to either of the classes. The precision of negative class is intuitively the ability of the classifier not to label as positive a sample that is negative. The precision of positive class is intuitively the ability of the classifier not to label as negative a sample that is positive. The best value of precision is 1 and the worst value is 0.

In Python, precision can be calculated using the code,

 precision_positive = metrics.precision_score(y_test, preds, pos_label=1)
 precision_negative = metrics.precision_score(y_test, preds, pos_label=0)
 precision_positive, precision_negative 

which gives (1.000, 0.935) as output.


Recall can also be defined with respect to either of the classes. Recall of positive class is also termed sensitivity and is defined as the ratio of the True Positive to the number of actual positive cases. It can intuitively be expressed as the ability of the classifier to capture all the positive cases. It is also called the True Positive Rate (TPR).

Recall of negative class is also termed specificity and is defined as the ratio of the True Negative to the number of actual negative cases. It can intuitively be expressed as the ability of the classifier to capture all the negative cases. It is also called True Negative Rate (TNR).

In python, sensitivity and specificity can be calculated as

 recall_sensitivity = metrics.recall_score(y_test, preds, pos_label=1)
 recall_specificity = metrics.recall_score(y_test, preds, pos_label=0)
 recall_sensitivity, recall_specificity 

which gives (0.881, 1.000) as output. The best value of recall is 1 and the worst value is 0. 


F1-score is considered one of the best metrics for classification models regardless of class imbalance. F1-score is the weighted average of recall and precision of the respective class. Its best value is 1 and the worst value is 0.

evaluation metrics

In python, F1-score can be determined for a classification model using

 f1_positive = metrics.f1_score(y_test, preds, pos_label=1)
 f1_negative = metrics.f1_score(y_test, preds, pos_label=0)
 f1_positive, f1_negative 

It gives an output of (0.937, 0.966)

Accuracy, Precision, Recall, and F1-score can altogether be calculated using the method classification_report in python

print(metrics.classification_report(y_test, preds))


See Also

evaluation metrics

Here, the macro average of any metric is calculated as the mean of respective values of all classes by giving equal weightage to all classes. On the other hand, the weighted average of any metric is calculated by giving weightage based on the number of data points in respective classes. In the above output, numbers 0 and 1 denote negative and positive classes respectively and column support refers to the number of data points in those classes.

ROC and AUC score

ROC is the short form of Receiver Operating Curve, which helps determine the optimum threshold value for classification. The threshold value is the floating-point value between two classes forming a boundary between those two classes. Here in our model, any predicted output above the threshold is classified as class 1 and below it is classified as class 0.

ROC is realized by visualizing it in a plot. The area under ROC, famously known as AUC is used as a metric to evaluate the classification model. ROC is drawn by taking false positive rate in the x-axis and true positive rate in the y-axis. The best value of AUC is 1 and the worst value is 0. However, AUC of 0.5 is generally considered the bottom reference of a classification model. 

In python, ROC can be plotted by calculating the true positive rate and false-positive rate. The values are calculated in steps by changing the threshold value from 0 to 1 gradually.

 preds_train = model.predict(X_train)
 # calculate prediction probability
 prob_train = np.squeeze(model.predict_proba(X_train)[:,1].reshape(1,-1))
 prob_test = np.squeeze(model.predict_proba(X_test)[:,1].reshape(1,-1))
 # false positive rate, true positive rate, thresholds
 fpr1, tpr1, thresholds1 = metrics.roc_curve(y_test, prob_test)
 fpr2, tpr2, thresholds2 = metrics.roc_curve(y_train, prob_train)
 # auc score
 auc1 = metrics.auc(fpr1, tpr1)
 auc2 = metrics.auc(fpr2, tpr2)
 # plot auc 
 plt.plot(fpr1, tpr1, color='blue', label='Test ROC curve area = %0.2f'%auc1)
 plt.plot(fpr2, tpr2, color='green', label='Train ROC curve area = %0.2f'%auc2)
 plt.plot([0,1],[0,1], 'r--')
 plt.xlim([-0.1, 1.1])
 plt.ylim([-0.1, 1.1])
 plt.xlabel('False Positive Rate', size=14)
 plt.ylabel('True Positive Rate', size=14)
 plt.legend(loc='lower right') 

Tuning ROC to find the optimum threshold value: Python guides find the right value of threshold (cut-off) with the following codes.

 # creating index
 i = np.arange(len(tpr1))
 # extracting roc values against different thresholds 
 roc = pd.DataFrame({'fpr':fpr1, 'tpr':tpr1, 'tf':(tpr1-1+fpr1), 'thresholds':thresholds1}, index=i)
 # top 5 best roc occurrences 

Precision-Recall Curve

To find the best threshold value based on the trade-off between precision and recall, precision_recall_curve is drawn. 

 pre, rec, thr = metrics.precision_recall_curve(y_test, prob_test)
 plt.plot(thr, pre[:-1], label='precision')
 plt.plot(thr, rec[1:], label='recall')
 plt.title('Precision & Recall vs Threshold', c='r', size=16)
evaluation metrics

Trade-off performed by our random forest model between Precision and Recall can be visualized using the following codes:

 fig, ax = plt.subplots(1,1, figsize=(8,8))
 metrics.plot_precision_recall_curve(model, X_test, y_test, ax=ax) 
evaluation metrics

Hamming Loss

Hamming loss is the fraction of targets that are misclassified. The best value of the hamming loss is 0 and the worst value is 1. It can be calculated as 

 hamming_loss = metrics.hamming_loss(y_test, preds)

to give an output of 0.044.

Jaccard Score

Jaccard score is defined as the ratio of the size of the intersection to the size of the union of label classes between predicted labels and ground truth labels. It is considered a similarity coefficient to compare the predicted classes and true classes. The value of 1 denotes the best classification and 0 denotes the worst. Jaccard loss is considered a poor choice if the class distribution is imbalanced.

 jaccard = metrics.jaccard_score(y_test, preds)

gives an output of 0.881

Cross-entropy loss

Cross-entropy loss, also known as log loss, becomes famous in deep neural networks because of its ability to overcome vanishing gradient problems. It measures the impurity caused by misclassification. The cross-entropy loss is calculated as the summation of the logarithmic value of prediction probability distribution for misclassified data points. 

In python, cross-entropy loss can be calculated using the code,

 # Entropy loss
 cross_entropy_loss = metrics.log_loss(y_test, prob_test)

which gives an output of 0.463, where 0 denotes perfect classification or zero impurity.

Other metrics

Though we have covered most of the evaluation metrics for classification in this article, few metrics meant only for multi-class classification are left untouched. Interested readers can refer to the official documentation of metrics used by Scikit-Learn, TensorFlow,  and PyTorch.

For further readings:

What Do You Think?

Join Our Telegram Group. Be part of an engaging online community. Join Here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top