There are various methods used to evaluate the performance of a classification model. The Cumulative Accuracy Profile (CAP) curve analysis is one of those methods of evaluation. In this article, the CAP curve analysis method has been discussed where it is used to evaluate and compare the performances of four different classifiers in their classification task. These models are used in the classification where it has been predicted that whether a user will buy a product or not given a social network advertisement of that product. Through the CAP curve analysis, we will be able to identify the best model among four in this classification or prediction.
Cumulative Accuracy Profile
The Cumulative Accuracy Profile (CAP) is used as a tool in machine learning through which the discriminative power of a classification model is visualized. The CAP of a model represents the cumulative number of positive outcomes along the y-axis versus the corresponding cumulative number of a classifying parameter along the x-axis. The CAP is different from the Receiver Operator Characteristic (ROC) curves as ROC curves plot the true-positive rate against the false-positive rate of classification.
In analyzing a classification model, the CAP curve analysis compares that model with a perfect classification model and a random classification model. It evaluates a model by comparing the curve to the perfect CAP in which the maximum number of positive outcomes is achieved directly and to the random CAP in which the positive outcomes are distributed equally. A good model will have a CAP between the perfect CAP and the random CAP with a better model tending to the perfect CAP.
Social Network Ads Prediction
In this experiment, we have taken the social network ads data set that is publically available on Kaggle. A company has placed the advertisement for its product on a social networking site and it has recorded the details of persons who have clicked on the advertisement and bought or not bought the product. The data set includes the features of customers including user id, gender, age, expected salary and whether they have purchased the product or not. On the basis of other attributes, it has been predicted whether the person will purchase the advertised product or not using the classification models. There are 400 observations in the data set where 300 are used to train the classification model and the remaining 100 are used to test the model.
Prediction and Performance Analysis
The task of predicting whether the customer will purchase the product or not is done using four classification models. The intuition behind using four models is to see the comparison of all these models and find the best among them. The four classification models used are Random Forest Model, Logistic Regression Model, K-Nearest Neighbor Model and Naive-Bayes Model. Once these models are trained then they are tested on prediction with new data. This prediction performance on new test data has been analyzed using the CAP curve analysis. In a plot having the random model and the perfect model, the performances of these four models have been visualized. Let us see the python code snippet for this task.
# Import the required libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Import the dataset and define the input and output features dataset = pd.read_csv('Social_Network_Ads.csv') X = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values # Split the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0) # Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) # Define the Random Forest model, train the model and make prediction on test data from sklearn.ensemble import RandomForestClassifier rf_classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0) rf_classifier.fit(X_train, y_train) y_pred_rf = rf_classifier.predict(X_test) #Define the Logsistic Regression model, train the model and make prediction on test data from sklearn.linear_model import LogisticRegression lr_classifier = LogisticRegression() lr_classifier.fit(X_train, y_train) y_pred_lr = lr_classifier.predict(X_test) #Define the KNN model, train the model and make prediction on test data from sklearn.neighbors import KNeighborsClassifier knn_classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2) knn_classifier.fit(X_train, y_train) y_pred_knn = knn_classifier.predict(X_test) #Define the Naive Bayes model, train the model and make prediction on test data from sklearn.naive_bayes import GaussianNB nb_classifier = GaussianNB() nb_classifier.fit(X_train, y_train) y_pred_nb = nb_classifier.predict(X_test) #Visualize the CAP Curve Analysis including all 4 classification models total = len(y_test) one_count = np.sum(y_test) zero_count = total - one_count lm_rf = [y for _, y in sorted(zip(y_pred_rf, y_test), reverse = True)] lm_lr = [y for _, y in sorted(zip(y_pred_lr, y_test), reverse = True)] lm_knn = [y for _, y in sorted(zip(y_pred_knn, y_test), reverse = True)] lm_nb = [y for _, y in sorted(zip(y_pred_nb, y_test), reverse = True)] x = np.arange(0, total + 1) y_rf = np.append([0], np.cumsum(lm_rf)) y_lr = np.append([0], np.cumsum(lm_lr)) y_knn = np.append([0], np.cumsum(lm_knn)) y_nb = np.append([0], np.cumsum(lm_nb)) plt.figure(figsize = (10, 6)) plt.plot([0, total], [0, one_count], c = 'b', linestyle = '--', label = 'Random Model') plt.plot([0, one_count, total], [0, one_count, one_count], c = 'grey', linewidth = 2, label = 'Perfect Model') plt.title('CAP Curve of Classifiers') plt.plot(x, y_rf, c = 'b', label = 'RF classifier', linewidth = 2) plt.plot(x, y_lr, c = 'r', label = 'LR classifier', linewidth = 2) plt.plot(x, y_knn, c = 'y', label = 'KNN classifier', linewidth = 2) plt.plot(x, y_nb, c = 'm', label = 'NB classifier', linewidth = 2) plt.legend()![]()
As we can see in the above plot, the K-NN model seems to be best among all four models because it is most close towards the perfect model. The Logistic regression model seems to be worst in comparison to all four models as it is most far away from the perfect model. In this way, we can check and compare the performance of various classification models on the same data set and find out the best one in the required task.