Cumulative Accuracy Profile (CAP) Curve Analysis to Evaluate Classification Models in Social Network Ads Prediction

In this article, the CAP curve analysis method has been discussed where it is used to evaluate and compare the performances of four different classifiers in their classification task.
CAP Curve Analysis

There are various methods used to evaluate the performance of a classification model. The Cumulative Accuracy Profile (CAP) curve analysis is one of those methods of evaluation. In this article, the CAP curve analysis method has been discussed where it is used to evaluate and compare the performances of four different classifiers in their classification task. These models are used in the classification where it has been predicted that whether a user will buy a product or not given a social network advertisement of that product. Through the CAP curve analysis, we will be able to identify the best model among four in this classification or prediction.

Cumulative Accuracy Profile

The Cumulative Accuracy Profile (CAP) is used as a tool in machine learning through which the discriminative power of a classification model is visualized. The CAP of a model represents the cumulative number of positive outcomes along the y-axis versus the corresponding cumulative number of a classifying parameter along the x-axis. The CAP is different from the Receiver Operator Characteristic (ROC) curves as ROC curves plot the true-positive rate against the false-positive rate of classification. 

In analyzing a classification model, the CAP curve analysis compares that model with a perfect classification model and a random classification model. It evaluates a model by comparing the curve to the perfect CAP in which the maximum number of positive outcomes is achieved directly and to the random CAP in which the positive outcomes are distributed equally. A good model will have a CAP between the perfect CAP and the random CAP with a better model tending to the perfect CAP.

Social Network Ads Prediction

In this experiment, we have taken the social network ads data set that is publically available on Kaggle. A company has placed the advertisement for its product on a social networking site and it has recorded the details of persons who have clicked on the advertisement and bought or not bought the product. The data set includes the features of customers including user id, gender, age, expected salary and whether they have purchased the product or not. On the basis of other attributes, it has been predicted whether the person will purchase the advertised product or not using the classification models. There are 400 observations in the data set where 300 are used to train the classification model and the remaining 100 are used to test the model. 

Prediction and Performance Analysis

The task of predicting whether the customer will purchase the product or not is done using four classification models. The intuition behind using four models is to see the comparison of all these models and find the best among them. The four classification models used are Random Forest Model, Logistic Regression Model, K-Nearest Neighbor Model and Naive-Bayes Model. Once these models are trained then they are tested on prediction with new data. This prediction performance on new test data has been analyzed using the CAP curve analysis. In a plot having the random model and the perfect model, the performances of these four models have been visualized. Let us see the python code snippet for this task.

# Import the required libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


# Import the dataset and define the input and output features
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values


# Split the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)


# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


# Define the Random Forest model, train the model and make prediction on test data
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
rf_classifier.fit(X_train, y_train)
y_pred_rf = rf_classifier.predict(X_test)


#Define the Logsistic Regression model, train the model and make prediction on test data
from sklearn.linear_model import LogisticRegression
lr_classifier = LogisticRegression()
lr_classifier.fit(X_train, y_train)
y_pred_lr = lr_classifier.predict(X_test)


#Define the KNN model, train the model and make prediction on test data
from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn_classifier.fit(X_train, y_train)
y_pred_knn = knn_classifier.predict(X_test)


#Define the Naive Bayes model, train the model and make prediction on test data
from sklearn.naive_bayes import GaussianNB
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)
y_pred_nb = nb_classifier.predict(X_test)


#Visualize the CAP Curve Analysis including all 4 classification models
total = len(y_test) 
one_count = np.sum(y_test) 
zero_count = total - one_count 
lm_rf = [y for _, y in sorted(zip(y_pred_rf, y_test), reverse = True)]
lm_lr = [y for _, y in sorted(zip(y_pred_lr, y_test), reverse = True)] 
lm_knn = [y for _, y in sorted(zip(y_pred_knn, y_test), reverse = True)] 
lm_nb = [y for _, y in sorted(zip(y_pred_nb, y_test), reverse = True)] 
x = np.arange(0, total + 1) 
y_rf = np.append([0], np.cumsum(lm_rf)) 
y_lr = np.append([0], np.cumsum(lm_lr)) 
y_knn = np.append([0], np.cumsum(lm_knn)) 
y_nb = np.append([0], np.cumsum(lm_nb)) 
plt.figure(figsize = (10, 6)) 
plt.plot([0, total], [0, one_count], c = 'b', linestyle = '--', label = 'Random Model')
plt.plot([0, one_count, total], [0, one_count, one_count], c = 'grey', linewidth = 2, label = 'Perfect Model')
plt.title('CAP Curve of Classifiers')
plt.plot(x, y_rf, c = 'b', label = 'RF classifier', linewidth = 2)
plt.plot(x, y_lr, c = 'r', label = 'LR classifier', linewidth = 2)
plt.plot(x, y_knn, c = 'y', label = 'KNN classifier', linewidth = 2)
plt.plot(x, y_nb, c = 'm', label = 'NB classifier', linewidth = 2)
plt.legend()




 

As we can see in the above plot, the K-NN model seems to be best among all four models because it is most close towards the perfect model. The Logistic regression model seems to be worst in comparison to all four models as it is most far away from the perfect model. In this way, we can check and compare the performance of various classification models on the same data set and find out the best one in the required task. 

Download our Mobile App

Dr. Vaibhav Kumar
Dr. Vaibhav Kumar is a seasoned data science professional with great exposure to machine learning and deep learning. He has good exposure to research, where he has published several research papers in reputed international journals and presented papers at reputed international conferences. He has worked across industry and academia and has led many research and development projects in AI and machine learning. Along with his current role, he has also been associated with many reputed research labs and universities where he contributes as visiting researcher and professor.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week. 

How Generative AI is Revolutionising Data Science Tools

How Generative AI is Revolutionising Data Science Tools

Einblick Prompt enables users to create complete data workflows using natural language, accelerating various stages of data science and analytics. Einblick has effectively combined the capabilities of a Jupyter notebook with the user-friendliness of ChatGPT.