Cumulative Accuracy Profile (CAP) Curve Analysis to Evaluate Classification Models in Social Network Ads Prediction

In this article, the CAP curve analysis method has been discussed where it is used to evaluate and compare the performances of four different classifiers in their classification task.
CAP Curve Analysis

There are various methods used to evaluate the performance of a classification model. The Cumulative Accuracy Profile (CAP) curve analysis is one of those methods of evaluation. In this article, the CAP curve analysis method has been discussed where it is used to evaluate and compare the performances of four different classifiers in their classification task. These models are used in the classification where it has been predicted that whether a user will buy a product or not given a social network advertisement of that product. Through the CAP curve analysis, we will be able to identify the best model among four in this classification or prediction.

Cumulative Accuracy Profile

The Cumulative Accuracy Profile (CAP) is used as a tool in machine learning through which the discriminative power of a classification model is visualized. The CAP of a model represents the cumulative number of positive outcomes along the y-axis versus the corresponding cumulative number of a classifying parameter along the x-axis. The CAP is different from the Receiver Operator Characteristic (ROC) curves as ROC curves plot the true-positive rate against the false-positive rate of classification. 

In analyzing a classification model, the CAP curve analysis compares that model with a perfect classification model and a random classification model. It evaluates a model by comparing the curve to the perfect CAP in which the maximum number of positive outcomes is achieved directly and to the random CAP in which the positive outcomes are distributed equally. A good model will have a CAP between the perfect CAP and the random CAP with a better model tending to the perfect CAP.


Sign up for your weekly dose of what's up in emerging technology.

Social Network Ads Prediction

In this experiment, we have taken the social network ads data set that is publically available on Kaggle. A company has placed the advertisement for its product on a social networking site and it has recorded the details of persons who have clicked on the advertisement and bought or not bought the product. The data set includes the features of customers including user id, gender, age, expected salary and whether they have purchased the product or not. On the basis of other attributes, it has been predicted whether the person will purchase the advertised product or not using the classification models. There are 400 observations in the data set where 300 are used to train the classification model and the remaining 100 are used to test the model. 

Prediction and Performance Analysis

The task of predicting whether the customer will purchase the product or not is done using four classification models. The intuition behind using four models is to see the comparison of all these models and find the best among them. The four classification models used are Random Forest Model, Logistic Regression Model, K-Nearest Neighbor Model and Naive-Bayes Model. Once these models are trained then they are tested on prediction with new data. This prediction performance on new test data has been analyzed using the CAP curve analysis. In a plot having the random model and the perfect model, the performances of these four models have been visualized. Let us see the python code snippet for this task.

# Import the required libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Import the dataset and define the input and output features
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

# Split the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Define the Random Forest model, train the model and make prediction on test data
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0), y_train)
y_pred_rf = rf_classifier.predict(X_test)

#Define the Logsistic Regression model, train the model and make prediction on test data
from sklearn.linear_model import LogisticRegression
lr_classifier = LogisticRegression(), y_train)
y_pred_lr = lr_classifier.predict(X_test)

#Define the KNN model, train the model and make prediction on test data
from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2), y_train)
y_pred_knn = knn_classifier.predict(X_test)

#Define the Naive Bayes model, train the model and make prediction on test data
from sklearn.naive_bayes import GaussianNB
nb_classifier = GaussianNB(), y_train)
y_pred_nb = nb_classifier.predict(X_test)

#Visualize the CAP Curve Analysis including all 4 classification models
total = len(y_test) 
one_count = np.sum(y_test) 
zero_count = total - one_count 
lm_rf = [y for _, y in sorted(zip(y_pred_rf, y_test), reverse = True)]
lm_lr = [y for _, y in sorted(zip(y_pred_lr, y_test), reverse = True)] 
lm_knn = [y for _, y in sorted(zip(y_pred_knn, y_test), reverse = True)] 
lm_nb = [y for _, y in sorted(zip(y_pred_nb, y_test), reverse = True)] 
x = np.arange(0, total + 1) 
y_rf = np.append([0], np.cumsum(lm_rf)) 
y_lr = np.append([0], np.cumsum(lm_lr)) 
y_knn = np.append([0], np.cumsum(lm_knn)) 
y_nb = np.append([0], np.cumsum(lm_nb)) 
plt.figure(figsize = (10, 6)) 
plt.plot([0, total], [0, one_count], c = 'b', linestyle = '--', label = 'Random Model')
plt.plot([0, one_count, total], [0, one_count, one_count], c = 'grey', linewidth = 2, label = 'Perfect Model')
plt.title('CAP Curve of Classifiers')
plt.plot(x, y_rf, c = 'b', label = 'RF classifier', linewidth = 2)
plt.plot(x, y_lr, c = 'r', label = 'LR classifier', linewidth = 2)
plt.plot(x, y_knn, c = 'y', label = 'KNN classifier', linewidth = 2)
plt.plot(x, y_nb, c = 'm', label = 'NB classifier', linewidth = 2)


As we can see in the above plot, the K-NN model seems to be best among all four models because it is most close towards the perfect model. The Logistic regression model seems to be worst in comparison to all four models as it is most far away from the perfect model. In this way, we can check and compare the performance of various classification models on the same data set and find out the best one in the required task. 

More Great AIM Stories

Dr. Vaibhav Kumar
Vaibhav Kumar has experience in the field of Data Science and Machine Learning, including research and development. He holds a PhD degree in which he has worked in the area of Deep Learning for Stock Market Prediction. He has published/presented more than 15 research papers in international journals and conferences. He has an interest in writing articles related to data science, machine learning and artificial intelligence.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM