Whenever we perform a classification task, whether it is the binary or multi-class classification, we follow the approach- preprocess the data, split it into train and test by the train_test_split class of scikit-learn library then after training your model, you will get some accuracy, but is it the best accuracy of your model or whether this model will give the best performance at the time of deployment? So here comes the importance of various performance metrics and the data splitting techniques. Today in this article, we will see how Stratified K Fold and various performance metrics can help us build a robust machine, learning-based model.

You might have observed that when you change the random_state inside the train_test_split class, the model’s accuracy is changed, and this will be changing to any value of random_state due to the fact that we can not observe the right accuracy of the model. As its name justifies, it samples the data without taking care of the distribution of classes. Say you are dealing with binary classification out of 100% dataset 70% belong to class 0 and remaining to class 1; for this balance, if you do random sampling, there is a high chance of getting different class distributions between training and testing. By tearing on such a dataset, you will get poor accuracy.

The most used validation technique is K-Fold Cross-validation which involves splitting the training dataset into k folds. The first k-1 folds are used for training, and the remaining fold is held for testing, which is repeated for K-folds. A total of K folds are fit and evaluated, and the mean accuracy for all these folds is returned. This process has shown an optimistic result for balanced classification tasks, but it fails for imbalance classes. This is due to cross-validation, which also splits the data randomly without taking care of the class imbalance.

So the solution is not to split the data randomly, but it should be split in a stratified manner. The stratified k fold cross-validation is an extension of the cross-validation technique used for classification problems. It maintains the same class ratio throughout the K folds as the ratio in the original dataset. So, for example, you are dealing with diabetes prediction in which you have the class ratio of 70/30; by using stratified K fold, the same class ratio is preserved throughout the K folds.

Next in the article, we will implement the Stratified K-Fold cross-validation and analyze its importance on several parameters. The below python code shows that how one can use the Stratified K Fold Cross-validation for a classification problem, after training our classifier the performance of the same will be evaluated against the following metrics:-

- Confusion Matrix
- ROC AUC Curve
- F-1 Score
- Brier Score

## Implementing Stratified K-fold Cross-Validation in Python

Now let’s take a look at the practical implementation of Stratified K fold. Here, the dataset we are working on tells us whether the particular patient will have diabetes based on seven input features.

We will define the 10 cross fold strategy in the Stratified K-fold class, which is the scikit-learn package that will preserve the class ratio.

Let’s start by importing all dependencies;

```
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
```

Load and take a look at the dataset;

```
dataset = pd.read_csv('/content/diabetes.csv')
dataset.head(10)
```

Before proceeding further, check the class distribution; the below percentage shows the share of positive class, i,e logic 1;

`print('Out of 100%, nearly {}% belongs to positive class'.format(round(sum(dataset.Outcome/len(dataset.Outcome)*100))))`

So out of 100% data, 35% of patients have been tested positive for diabetes which means we have a ratio of 35/65.

To use the same ratio throughout the 10 folds below, we initialise the stratified K fold class, returning the 10 folds with the same class distribution percentage.

`skf = StratifiedKFold(n_splits=10)`

Now here we are using Logistic regression with a solver as newton-cg to avoid any convergence issue, and separate target variable and dataset as below;

```
model = LogisticRegression(solver='newton-cg')
x = dataset
y = dataset.Outcome
```

The below user-defined function is used to smoothen the process of training so that we need to train the model manually repeatedly;

```
def training(train, test, fold_no):
x_train = train.drop(['Outcome'],axis=1)
y_train = train.Outcome
x_test = test.drop(['Outcome'],axis=1)
y_test = test.Outcome
model.fit(x_train, y_train)
score = model.score(x_test,y_test)
print('For Fold {} the accuracy is {}'.format(str(fold_no),score))
```

Try the complete code together;

```
dataset = pd.read_csv('/content/diabetes.csv')
skf = StratifiedKFold(n_splits=10)
model = LogisticRegression(solver='newton-cg')
x = dataset
y = dataset.Outcome
def training(train, test, fold_no):
x_train = train.drop(['Outcome'],axis=1)
y_train = train.Outcome
x_test = test.drop(['Outcome'],axis=1)
y_test = test.Outcome
model.fit(x_train, y_train)
score = model.score(x_test,y_test)
print('For Fold {} the accuracy is {}'.format(str(fold_no),score))
fold_no = 1
for train_index,test_index in skf.split(x, y):
train = dataset.iloc[train_index,:]
test = dataset.iloc[test_index,:]
training(train, test, fold_no)
fold_no += 1
```

We have a fairly robust model trained on 10 folds and has given a mean accuracy of nearly 78%. Now we should not consider this as the final model. Because till now, we have only focused on overall accuracy, which is not the correct measure for the model. This model should be evaluated against various performance metrics to ensure the robustness of the model.

Now as we have our classifier trained on different folds, as mentioned earlier, we are going to check the performance of our model with the test data and we will try to understand how all the below matrices are significantly important for classification problems.

#### Confusion Matrix:

It is a performance measurement for the ML model specifically for classification problems where output can be two or more classes. So basically, it is a table of four different combinations of predicted and actual values. It is extremely useful for measuring Recall, Precision, Specificity, Accuracy, and, most importantly, curves like ROC AUC.

Let’s plot the confusion matrix for our model;

```
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(model, X_train,y_train)
```

#### ROC AUC Curve:

A ROC curve is referred to as Receiver Operating Characteristic Curve; basically, it is the plot that summarizes the model’s binary classification performance on the positive class. AUC is referred to as Area Under the Curve. Where X-axis indicates the False Positive Rate and Y-axis indicates the True Positive Rate. This plot gives the information of how the model is predicting the correct classification and wrong classifications.

Let’s see the ROC curve of our model;

```
from sklearn.metrics import plot_roc_curve
plot_roc_curve(model, X_train,y_train )
```

#### F1 Score:

The F-score, F measure or F1 score is a measure of the test’s accuracy and it is calculated by the weighted average of Precision and Recall. Its value varies between 0 and 1 and the best value is 1

```
from sklearn.metrics import f1_score
print('F1 score is {}'.format(f1_score(y_test,model.predict(X_test))))
```

For our model F1 score is 0.63 which is a decent score.

#### Brier Score:

The Brier score calculates the mean squared error between the predicted probability and expected values. Thus, the score summarizes the magnitude of error in the probability forecast. The error score is always between 0 and 1. Hence, those who have the perfect skill set will have an error score of 0.

Let’s check the Brier score for our model;

```
from sklearn.metrics import brier_score_loss
probs = model.predict_proba(X_test)
# keeping the prediction for class 1
probs = probs[:,1]
print('Brier loss: ', brier_score_loss(y_test,probs))
```

Output: Brier loss: 0.17

Our model has a good Brier Score.

## Conclusion:

In this article, we have seen how common methods such as using train_test_split() class can mislead the model performance when it comes to the imbalance classification task. To counter this issue, we have seen the practical use case of Stratified K fold cross-validation, which splits the in K folds by preserving the class ratio as in the original dataset, for the new stratified dataset we have trained our classifier. Later we evaluated our classifier with the help of performance metrics.