Chi-Square Test: An Analysis Of Relationship Between Variables

chi square data science python

Let’s think of a scenario — we are looking to build a predictive model which will predictive the probability of a telecom customer attrition. One of the variables we have got in our data is a binary variable (two categories 0,1) which indicates whether the customer has internet services or not. While exploring the data, one of statistical test we can perform between churn and internet services is chi-square — a test of the relationship between two variables — to know if internet services could be one of the strong predictors of churn.

To kick off with understanding the intricate details of this concept, let’s start from the very beginning. We start analyzing data while simultaneously deriving statistical reports, Descriptive and Inferential being the two forms for the same. Descriptive statistics have helped to make the descriptions of our data sets very easy. It has made us, as analysts or as curious folks look at the highly complex data sets and get to know a lot about it in a single glance. This includes watching over the mean, mode or median along with the averages and graphical plots for the vast information that the data frame entails. Inferential Statistics, however, helps in understanding how the various variables are related and if the relationship that pertains amongst them is significant or not. It has helped to make conclusions from data and generalize it in the longer run (starting the trail from samples to large population groups).

Chi-Square is one of the inferential statistics that is used to formulate and check the interdependence of two or more variables. It works great for categorical or nominal variables but can include ordinal variables also. Before we go deep in this concept there are a couple of things that are to be kept in mind while working with this method.

  1. The test can be applied over only categorical variables. Variables like height and distance can’t be test objects via chi-square.
  2. The chosen sample sizes should be large, and each entry must be 5 or more.

Now that we are clear with all the limitations that the test might entail, let’s move ahead to apply this test over a data.

Suppose we have a data which revolves around the preference of men and women for the field of data science.

H0: Null Hypothesis: More men prefer data science than women

Proportion of male vs. female — its the total number of male divided by total. 1750/2800=0.625 for male; 1050/2800=0.375 for female

Data Science vs. No Data Science — Total number of data science vs. no data science.

Expected = B multiplied with C

Residual = A minus D

Here, the chi-square value is 140

In order to make an inference from the chi-square statistics, we need these three values:

  1. Probability value
  2. Degree of freedom
  3. Critical values

To further convert this value to a probabilistic value we must work upon with the degree of freedom.

dof= (2–1) (2–1) = 1 since we have 2×2 matrix as in there are two categories for each variable.

We can use the following table to get the critical values. For this example, we have 1 dof and for confidence interval level at 0.05, critical value is 3.841

Since Chi-square value (140) is greater than critical value of 3.841, we reject the null hypothesis meaning there is a dependency between gender and data science preference. This means of the total population of data scientist’s majority 53% are male.

Below is the Python code to calculate chi-square:

#Importing all the required libraries
import pandas as pd
import numpy as np
from sklearn.feature_selection import chi2
import scipy.stats

#creating a dataframe which will help us have 3x2 matrix
subscription = {'Age Group': ['21 and below','21-35 Age','35 and above','21 and below','21-35 Age','35 and above'],
'Subscription':['Yes','Yes','Yes','No','No','No'],
'Subscriber': [700,300,200,500,700,600]
}

df = pd.DataFrame(subscription,columns= ['Age Group','Subscription', 'Subscriber'])

# steps to calculate chi-square

cnt=pd.crosstab(df['Age Group'], df['Subscription'], values=df['Subscriber'], aggfunc='sum').round(0)
stat, p, dof, expected = scipy.stats.chi2_contingency(cnt)
print(stat)
prob= 0.95 #confidence interval or probability
critical = scipy.stats.chi2.ppf(prob, dof)
print('probability=%.3f, critical=%.3f, stat=%.3f \n' % (prob, critical, stat))
#critical value is calculated from chi sqaure distribution table

#H0 (Null Hypothesis): Netflix subscription is not dependent on the age group subsciber belongs to
#H1 (Alternate Hypothesis): Netflix subscription is dependent on the age group subsciber belongs to 

#there are two methods to evaluate chi-square
#Method 1
if abs(stat) >= critical:
print('Dependent (reject Null Hypothesis)')
else:
print('Independent (fail to reject Null Hypothesis)')

#Method 2
#interpret p-value
alpha = 1.0 - prob
print('significance=%.3f, p=%.3f' % (alpha, p))
if p <= alpha:
print('Dependent (reject Null Hypothesis)')
else:
print('Independent (fail to reject Null Hypothesis)')

Chi-square is an important statistics and can help reveal important relationship between variables. Get your hands on it, try working on a couple of problems and see for yourself!

(This article originally appeared here. Re-published with author’s permission.)

More Great AIM Stories

Neha Wadhawan
Neha Wadhwan is a part of AIM Writers Programme. She has done her Masters in Economics from Delhi School of Economics and has over 12 years of experience in Analytics in sectors like Banking, Telecom, media and e-commerce. Analytics is her passion and she loves playing with numbers and creating a story out of data

More Stories

OUR UPCOMING EVENTS

8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

MORE FROM AIM
Yugesh Verma
All you need to know about Graph Embeddings

Embeddings can be the subgroups of a group, similarly, in graph theory embedding of a graph can be considered as a representation of a graph on a surface, where points of that surface are made up of vertices and arcs are made up of edges

Yugesh Verma
A beginner’s guide to Spatio-Temporal graph neural networks

Spatio-temporal graphs are made of static structures and time-varying features, and such information in a graph requires a neural network that can deal with time-varying features of the graph. Neural networks which are developed to deal with time-varying features of the graph can be considered as Spatio-temporal graph neural networks. 

Vijaysinh Lendave
How to Evaluate Recommender Systems with RGRecSys?

A recommender system, sometimes known as a recommendation engine, is a type of information filtering system that attempts to forecast a user’s “rating” or “preference” for an item. In this post, we will look at RGRecSys, a library that performs constraint evaluation of recommender systems.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM