Active Hackathon

# Chi-Square Test: An Analysis Of Relationship Between Variables

Let’s think of a scenario — we are looking to build a predictive model which will predictive the probability of a telecom customer attrition. One of the variables we have got in our data is a binary variable (two categories 0,1) which indicates whether the customer has internet services or not. While exploring the data, one of statistical test we can perform between churn and internet services is chi-square — a test of the relationship between two variables — to know if internet services could be one of the strong predictors of churn.

To kick off with understanding the intricate details of this concept, let’s start from the very beginning. We start analyzing data while simultaneously deriving statistical reports, Descriptive and Inferential being the two forms for the same. Descriptive statistics have helped to make the descriptions of our data sets very easy. It has made us, as analysts or as curious folks look at the highly complex data sets and get to know a lot about it in a single glance. This includes watching over the mean, mode or median along with the averages and graphical plots for the vast information that the data frame entails. Inferential Statistics, however, helps in understanding how the various variables are related and if the relationship that pertains amongst them is significant or not. It has helped to make conclusions from data and generalize it in the longer run (starting the trail from samples to large population groups).

#### THE BELAMY

Chi-Square is one of the inferential statistics that is used to formulate and check the interdependence of two or more variables. It works great for categorical or nominal variables but can include ordinal variables also. Before we go deep in this concept there are a couple of things that are to be kept in mind while working with this method.

1. The test can be applied over only categorical variables. Variables like height and distance can’t be test objects via chi-square.
2. The chosen sample sizes should be large, and each entry must be 5 or more.

Now that we are clear with all the limitations that the test might entail, let’s move ahead to apply this test over a data.

Suppose we have a data which revolves around the preference of men and women for the field of data science.

H0: Null Hypothesis: More men prefer data science than women

Proportion of male vs. female — its the total number of male divided by total. 1750/2800=0.625 for male; 1050/2800=0.375 for female

Data Science vs. No Data Science — Total number of data science vs. no data science.

Expected = B multiplied with C

Residual = A minus D

Here, the chi-square value is 140

In order to make an inference from the chi-square statistics, we need these three values:

1. Probability value
2. Degree of freedom
3. Critical values

To further convert this value to a probabilistic value we must work upon with the degree of freedom.

dof= (2–1) (2–1) = 1 since we have 2×2 matrix as in there are two categories for each variable.

We can use the following table to get the critical values. For this example, we have 1 dof and for confidence interval level at 0.05, critical value is 3.841

Since Chi-square value (140) is greater than critical value of 3.841, we reject the null hypothesis meaning there is a dependency between gender and data science preference. This means of the total population of data scientist’s majority 53% are male.

Below is the Python code to calculate chi-square:

```#Importing all the required libraries
import pandas as pd
import numpy as np
from sklearn.feature_selection import chi2
import scipy.stats

#creating a dataframe which will help us have 3x2 matrix
subscription = {'Age Group': ['21 and below','21-35 Age','35 and above','21 and below','21-35 Age','35 and above'],
'Subscription':['Yes','Yes','Yes','No','No','No'],
'Subscriber': [700,300,200,500,700,600]
}

df = pd.DataFrame(subscription,columns= ['Age Group','Subscription', 'Subscriber'])

# steps to calculate chi-square

cnt=pd.crosstab(df['Age Group'], df['Subscription'], values=df['Subscriber'], aggfunc='sum').round(0)
stat, p, dof, expected = scipy.stats.chi2_contingency(cnt)
print(stat)
prob= 0.95 #confidence interval or probability
critical = scipy.stats.chi2.ppf(prob, dof)
print('probability=%.3f, critical=%.3f, stat=%.3f \n' % (prob, critical, stat))
#critical value is calculated from chi sqaure distribution table

#H0 (Null Hypothesis): Netflix subscription is not dependent on the age group subsciber belongs to
#H1 (Alternate Hypothesis): Netflix subscription is dependent on the age group subsciber belongs to

#there are two methods to evaluate chi-square
#Method 1
if abs(stat) >= critical:
print('Dependent (reject Null Hypothesis)')
else:
print('Independent (fail to reject Null Hypothesis)')

#Method 2
#interpret p-value
alpha = 1.0 - prob
print('significance=%.3f, p=%.3f' % (alpha, p))
if p <= alpha:
print('Dependent (reject Null Hypothesis)')
else:
print('Independent (fail to reject Null Hypothesis)')```

Chi-square is an important statistics and can help reveal important relationship between variables. Get your hands on it, try working on a couple of problems and see for yourself!

## More Great AIM Stories

### Solving Machine Learning Problems On Kaggle Vs Real Life

Neha Wadhwan is a part of AIM Writers Programme. She has done her Masters in Economics from Delhi School of Economics and has over 12 years of experience in Analytics in sectors like Banking, Telecom, media and e-commerce. Analytics is her passion and she loves playing with numbers and creating a story out of data

## Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Telegram Channel

Discover special offers, top stories, upcoming events, and more.

### Indian IT Finds it Difficult to Sustain Work from Home Any Longer

Hybrid work models provide the best of both worlds and offer the flexibility of remote working/working from home/working from anywhere.

### Engineering Emmys Announced – Who Were The Biggest Winners

Dr. Paul E. Debevec was awarded the Charles F. Jenkins Lifetime Achievement Award.

### How can the Indian Railway benefit from 5G?

Deploying multiple sensors will allow the Railways to monitor tracks, power systems and environmental conditions in real-time.

### Need a Fashion Designer? Just Ask the AI

AI technology has advanced to the level that it can create complicated unique designs

### Does India match up to the USA and China in AI-enabled warfare?

India’s military spending for 2021 was ranked as the third-highest in the world.

### ThoughtWorks Bats Thoughtfully, calls for Leveraging Tech Responsibly

Across the globe, there’s a lot of demand for data mesh, data platforms and modernising data ecosystems.

### The origin of Neo4j

Neo4j has more than 700 employees globally.

### Attention aspiring data scientists and analytics enthusiasts: Genpact is holding a career day in September!

Don’t miss the opportunity to interact with some of the brightest minds in analytics during Genpact’s Analytics Career Day.

### Poll Campaigns Get Interesting with Deepfakes, Chatbots & AI Candidates

The world around politics is changing as people nominate AI bots in elections, deepfake videos are circulated by political parties and AR and 3D holograms get popular in Indian politics.

### Decentralised, Distributed, Transparent: Blockchain to Disrupt Ad Industry

The distributed, decentralised and transparent system of blockchain checks ad frauds and increase ROI