The strength of a linear relationship between two quantitative variables can be measured using Correlation. It is a statistical method that is very easy in order to calculate and to interpret. It is generally represented by ‘r’ known as the coefficient of correlation.
This is the reason why it is highly misused by professionals because correlation cannot be termed for causation. It is not necessary that if two variables have a correlation then one is dependent on the other and similarly if there is no correlation between two variables it is possible that they might have some relation. This is where PPS(Predictive Power Score) comes into the role.
Predictive Power Score works similar to the coefficient of correlation but has some additional functionalities like:
- It works on both Linear and Non-Linear Relationships
- Can be applied to both Numeric and Categorical columns
- It finds more patterns in the data.
In this article, we will explore how we can use the Predictive Power Score to replace correlation.
Implementation:
PPS is an open-source python library so we will install it like any other python library using pip install ppscore.
- Importing required libraries
We will import ppscore along with pandas to load a dataset that we will work on.
import ppscore as pps
import pandas as pd
- Loading the Dataset
We will be using different datasets to explore different functionalities of PPS. We will first import an advertising dataset of an MNC which contains the target variable as ‘Sales’ and features like ‘TV’, ‘Radio’, etc.
df = pd.read_csv(‘advertising.csv’)
df.head()
- Finding Relation using PPScore
We will use some basic functions defined in ppscore.
- Finding the Relationship score
PP Score lies between 0(No Predictive Power) to 1(perfect predictive power), in this step we will find PPScore/Relationship between the target variable and the featured variable in the given dataset.
pps.score(df, "Sales", "TV")
Here we can see that along with the ppscore it provides a lot more information that is the Model it uses for finding the score, what is the core of the model, evaluation metric used, etc.
Similarly, we can find the PP Score for all the features against the targeted variable which is ‘Sales’ in our case using the predictor function.
pps.predictors(df, "Sales")
Here we can see that we found the predictive power score for all the features/predictors.
- Visualizing the correlation
Normally we create a correlation matrix and visualize it using a heatmap, PPS also has a matrix function which is similar to the correlation matrix. Let us create a pps matrix and visualize it.
For visualization, we will be using seaborn and we need to import it.
import seaborn as sns
matrix_df = pps.matrix(df).pivot(columns='x', index='y', values='ppscore')
sns.heatmap(matrix_df, annot=True)
This is how we can visualize the ppscore relationship between different attributes of the dataset.
Now let us explore one more dataset which contains both categorical and numerical data. The dataset can be downloaded from Kaggle and it contains attributes of different used cars and contains mixed data i.e. both numeric and categorical. We will remove all the column named features to reduce the number of columns.
df1 = pd.read_csv('cars.csv')
df1.head()
We have already seen how we can find the ppscore, so we will now compare the visualization using the normal correlation matrix and the ppscore matrix.
- Correlation
from matplotlib.pyplot import figure
figure(figsize=(12,8))
sns.heatmap(df1.corr(), annot=True)
Here we can see that the total number of attributes is 9 which are the numerical columns because correlation only finds relation between categorical columns.
- PPScore Matrix Visualization
figure(figsize=(12,8))
a = pps.matrix(df1).pivot(columns='x', index='y', values='ppscore')
sns.heatmap(a, annot=True)
Here we can see that it takes in the count all the columns which are there in the dataset which makes it more useful and powerful than correlation.
Conclusion:
In this article we saw how correlation can be replaced using ppscore, which is an open-source python library used for finding relationships in both numerical and categorical columns, we also visualized the relationship created by correlation and ppscore to see what’s the difference between them.