Let’s think of a scenario — we are looking to build a predictive model which will predictive the probability of a telecom customer attrition. One of the variables we have got in our data is a binary variable (two categories 0,1) which indicates whether the customer has internet services or not. While exploring the data, one of statistical test we can perform between churn and internet services is chi-square — a test of the relationship between two variables — to know if internet services could be one of the strong predictors of churn.

To kick off with understanding the intricate details of this concept, let’s start from the very beginning. We start analyzing data while simultaneously deriving statistical reports, Descriptive and Inferential being the two forms for the same. Descriptive statistics have helped to make the descriptions of our data sets very easy. It has made us, as analysts or as curious folks look at the highly complex data sets and get to know a lot about it in a single glance. This includes watching over the mean, mode or median along with the averages and graphical plots for the vast information that the data frame entails. Inferential Statistics, however, helps in understanding how the various variables are related and if the relationship that pertains amongst them is significant or not. It has helped to make conclusions from data and generalize it in the longer run (starting the trail from samples to large population groups).

Follow us on Google News>>

Chi-Square is one of the inferential statistics that is used to formulate and check the interdependence of two or more variables. It works great for categorical or nominal variables but can include ordinal variables also. Before we go deep in this concept there are a couple of things that are to be kept in mind while working with this method.

- The test can be applied over only categorical variables. Variables like height and distance can’t be test objects via chi-square.
- The chosen sample sizes should be large, and each entry must be 5 or more.

Now that we are clear with all the limitations that the test might entail, let’s move ahead to apply this test over a data.

Suppose we have a data which revolves around the preference of men and women for the field of data science.

Looking for a job change? Let us help you.

H0: Null Hypothesis: More men prefer data science than women

**Proportion of male vs. female** — its the total number of male divided by total. 1750/2800=0.625 for male; 1050/2800=0.375 for female

**Data Science vs. No Data Science **— Total number of data science vs. no data science.

**Expected **= B multiplied with C

**Residual **= A minus D

Here, the chi-square value is 140

In order to make an inference from the chi-square statistics, we need these three values:

- Probability value
- Degree of freedom
- Critical values

To further convert this value to a probabilistic value we must work upon with the degree of freedom.

dof= (2–1) (2–1) = 1 since we have 2×2 matrix as in there are two categories for each variable.

We can use the following table to get the critical values. For this example, we have 1 dof and for confidence interval level at 0.05, critical value is 3.841

Since Chi-square value (140) is greater than critical value of 3.841, we reject the null hypothesis meaning there is a dependency between gender and data science preference. This means of the total population of data scientist’s majority 53% are male.

Below is the Python code to calculate chi-square:

#Importing all the required libraries import pandas as pd import numpy as np from sklearn.feature_selection import chi2 import scipy.stats #creating a dataframe which will help us have 3x2 matrix subscription = {'Age Group': ['21 and below','21-35 Age','35 and above','21 and below','21-35 Age','35 and above'], 'Subscription':['Yes','Yes','Yes','No','No','No'], 'Subscriber': [700,300,200,500,700,600] } df = pd.DataFrame(subscription,columns= ['Age Group','Subscription', 'Subscriber']) # steps to calculate chi-square cnt=pd.crosstab(df['Age Group'], df['Subscription'], values=df['Subscriber'], aggfunc='sum').round(0) stat, p, dof, expected = scipy.stats.chi2_contingency(cnt) print(stat) prob= 0.95 #confidence interval or probability critical = scipy.stats.chi2.ppf(prob, dof) print('probability=%.3f, critical=%.3f, stat=%.3f \n' % (prob, critical, stat)) #critical value is calculated from chi sqaure distribution table #H0 (Null Hypothesis): Netflix subscription is not dependent on the age group subsciber belongs to #H1 (Alternate Hypothesis): Netflix subscription is dependent on the age group subsciber belongs to #there are two methods to evaluate chi-square #Method 1 if abs(stat) >= critical: print('Dependent (reject Null Hypothesis)') else: print('Independent (fail to reject Null Hypothesis)') #Method 2 #interpret p-value alpha = 1.0 - prob print('significance=%.3f, p=%.3f' % (alpha, p)) if p <= alpha: print('Dependent (reject Null Hypothesis)') else: print('Independent (fail to reject Null Hypothesis)')

Chi-square is an important statistics and can help reveal important relationship between variables. Get your hands on it, try working on a couple of problems and see for yourself!

*(This article originally appeared here. Re-published with author’s permission.)*