Listen to this story
|
Building a machine learning model is not a one-off. You might not know if the built model even works, or if it does, does it work as well as expected. Building such models works on the principle of getting feedback from a metric, implementing the right improvements, and rebuilding to achieve the perfect desired accuracy.
However, choosing the right metric to evaluate the performance and accuracy of your model is a task in itself. So, after finishing your predictive, classification, or regression model—here is a list of evaluation metrics that can help you test the accuracy and concreteness of the model.
Confusion Matrix
Simply, it is a matrix of 2×2 size for binary classification with one axis consisting of actual values and the other axis with predicted values. The size of the matrix can increase depending on the number of classes being predicted.
Otherwise known as the ‘error matrix’, it is a tabular visual representation of the predictions of the model against the ground truth labels.
True Positive is the correct positive prediction by the model.
True Negative is the correct negative prediction by the model.
False Positive is the wrong prediction of the positive by the model.
False Negative is the wrong prediction of the negative by the model.
With these values we can calculate the rate of each prediction category by a simple equation.
Classification Accuracy
The simplest metric, it is calculated by dividing the number of correct predictions by the total number of predictions, multiplied by 100.
Precision/Specificity
If the class distribution is imbalanced, classification accuracy isn’t the best indicator for the performance of the model. To tackle a class-specific problem, we need a precision metric which is calculated by True Positives divided by the sum of True Positives and False Positives.
Recall/Sensitivity
Recall is the fraction of samples from one class which are predicted correctly by the model. It is calculated by True Positives divided by the sum of True Positives and False Negatives.
Click here to read more about evaluation metrics for classification problems.
F1 Score
Now that we know what precision and recall are for classification problems, to calculate both simultaneously—F1, the harmonic mean of both, which also performs well on an imbalance dataset.
As shown in the above equation, F1 score gives the same importance to both—recall and precision. If we want to give more weight to one of them, F1 score can be calculated by attaching a value to either recall or precision depending on how many times the value is important. In the equation below, β is the weightage.
AUC – ROC
Area under the curve (AUC) is independent of changes in the proportion of responders. When we get a confusion matrix that produces a different value for each metric in a probabilistic model i.e., when for each recall (sensitivity), we get a different precision (specificity) value—we can plot a Receiver operating characteristic (ROC) curve and find the area under the curve as shown below.
Since the area is calculated between the axis, it always comes between 0 and 1. The closer it is to 1, the better the model is.
Root Mean Square Error (RMSE)
One of the most popular metrics used in regression problems, RMSE assumes that the occurring errors are unbiased and follows a normal distribution. The higher the number of samples, the more reliable reconstructing the error distribution through RMSE is. The equation of the metric is given by:
Click here for a more detailed explanation of different evaluation metrics.
Cross-entropy Loss
Otherwise known as ‘Log Loss’, Cross-entropy loss is famous in deep neural networks as it overcomes vanishing gradient problems. It is calculated by the summation of the logarithmic value of prediction probability distribution for wrongly classified data points.
Gini Coefficient
Used for classification problems, Gini Coefficient is derived from AUC – ROC number. It is the ratio between the ROC curve and the diagonal line. If the Gini Coefficient is above 60% then the model is considered good. The formulae used for this is:
Gini = 2*AUC – 1
Jaccard Score
Jaccard score is the similarity index measure between two sets of data. The score is calculated between 0 and 1 with 1 being the best. To calculate the Jaccard Score, we find the total number of observations in both the sets, and divide by the total number of observations in either set.
J(A, B) = |A∩B| / |A∪B|
Here is a practical guide for evaluation metrics for machine learning models.