10 Evaluation Metrics for Machine Learning Models

Check your machine learning models’ accuracy and performance based on these evaluation metrics.
Listen to this story

Building a machine learning model is not a one-off. You might not know if the built model even works, or if it does, does it work as well as expected. Building such models works on the principle of getting feedback from a metric, implementing the right improvements, and rebuilding to achieve the perfect desired accuracy. 

However, choosing the right metric to evaluate the performance and accuracy of your model is a task in itself. So, after finishing your predictive, classification, or regression model—here is a list of evaluation metrics that can help you test the accuracy and concreteness of the model. 

Confusion Matrix

Simply, it is a matrix of 2×2 size for binary classification with one axis consisting of actual values and the other axis with predicted values. The size of the matrix can increase depending on the number of classes being predicted. 


Sign up for your weekly dose of what's up in emerging technology.

Otherwise known as the ‘error matrix’, it is a tabular visual representation of the predictions of the model against the ground truth labels. 

True Positive is the correct positive prediction by the model.

True Negative is the correct negative prediction by the model.

False Positive is the wrong prediction of the positive by the model.

False Negative is the wrong prediction of the negative by the model. 

With these values we can calculate the rate of each prediction category by a simple equation.

Classification Accuracy

The simplest metric, it is calculated by dividing the number of correct predictions by the total number of predictions, multiplied by 100. 


If the class distribution is imbalanced, classification accuracy isn’t the best indicator for the performance of the model. To tackle a class-specific problem, we need a precision metric which is calculated by True Positives divided by the sum of True Positives and False Positives.


Recall is the fraction of samples from one class which are predicted correctly by the model. It is calculated by True Positives divided by the sum of True Positives and False Negatives.

Click here to read more about evaluation metrics for classification problems.

F1 Score

Now that we know what precision and recall are for classification problems, to calculate both simultaneously—F1, the harmonic mean of both, which also performs well on an imbalance dataset. 

As shown in the above equation, F1 score gives the same importance to both—recall and precision. If we want to give more weight to one of them, F1 score can be calculated by attaching a value to either recall or precision depending on how many times the value is important. In the equation below, β is the weightage.


Area under the curve (AUC) is independent of changes in the proportion of responders. When we get a confusion matrix that produces a different value for each metric in a probabilistic model i.e., when for each recall (sensitivity), we get a different precision (specificity) value—we can plot a Receiver operating characteristic (ROC) curve and find the area under the curve as shown below. 

Since the area is calculated between the axis, it always comes between 0 and 1. The closer it is to 1, the better the model is. 

Root Mean Square Error (RMSE)

One of the most popular metrics used in regression problems, RMSE assumes that the occurring errors are unbiased and follows a normal distribution. The higher the number of samples, the more reliable reconstructing the error distribution through RMSE is. The equation of the metric is given by: 

Click here for a more detailed explanation of different evaluation metrics.

Cross-entropy Loss 

Otherwise known as ‘Log Loss’, Cross-entropy loss is famous in deep neural networks as it overcomes vanishing gradient problems. It is calculated by the summation of the logarithmic value of prediction probability distribution for wrongly classified data points.

Gini Coefficient

Used for classification problems, Gini Coefficient is derived from AUC – ROC number. It is the ratio between the ROC curve and the diagonal line. If the Gini Coefficient is above 60% then the model is considered good. The formulae used for this is: 

Gini = 2*AUC – 1

Jaccard Score

Jaccard score is the similarity index measure between two sets of data. The score is calculated between 0 and 1 with 1 being the best. To calculate the Jaccard Score, we find the total number of observations in both the sets, and divide by the total number of observations in either set.

J(A, B) = |A∩B| / |A∪B|

Here is a practical guide for evaluation metrics for machine learning models.

More Great AIM Stories

Mohit Pandey
Mohit is a technology journalist who dives deep into the Artificial Intelligence and Machine Learning world to bring out information in simple and explainable words for the readers. He also holds a keen interest in photography, filmmaking, and the gaming industry.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM