# A beginner’s guide to Chi-square test in python from scratch

The Chi-square test is one of the statistical tests we can use to decide whether there is a correlation between the categorical variables.

Consider that you are exploring a dataset, and you are wondering if there is a relationship between the two categorical variables? To check this, you can go through an effective statistical test and in this scenario, the Chi-square test is the most suitable one. The Chi-square test is one of the statistical tests we can use to decide whether there is a correlation between the categorical variables by analysing the relationship between the observed and expected values. In this article, we will discuss the Chi-square test and we will understand its implementation in Python from scratch by taking random data. The major points to be discussed in this article are listed below.

1. What is chi-square?
2. Where chi-square is used?
3. Introduction to data
4. Implement chi-square
5. Yate’s Correction

Let’s begin with understanding what Chi-square test is.

## What is Chi-square?

The Chi-square test is a statistical test used to determine the relationship between the categorical variables/columns in the dataset. It examines the correlation between the variables which do not contain the continuous data.

#### AIM Daily XO

##### Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy

How to use the Chi-square test?

To use the chi-square test, we can take the following steps:

1. Define the null (H0) and alternative (H1) hypothesis.
2. Determine the value of alpha (𝞪) for according to the domain you are working. Ideally 𝞪=0.05 that means you are willing to take 0.5% of risk/ margin of error.
3. Check the data for Nans or other kind of errors.
4. Check the assumptions for the test.
5. At last perform the test and draw your conclusion  whether to reject or support null hypothesis (H0) .

The formula for the Chi-square test is given as:-

Where,

= chi-square

= observed value

= expected value

The Chi-square formula is a statistical method to compare two or more data samples. It is used with data that consist of variables distributed across various categories and is denoted by .

## Where is the Chi-square test used?

Pearson’s chi-squared test is a hypothesis test that is used to determine whether there is a significant association between two categorical variables in the data. The test involves two hypotheses (H0 & H1):

• H0 : The two categorical variables have no relationship (independent)
• H1 : There is a relationship (dependent) between two categorical variables

So as a null hypothesis, we keep the positive aspect of the test and in the alternate hypothesis, we keep the negative aspect. The positive aspect of chi-square is that there should not be any correlation because correlation can result in overfitting of the machine learning algorithm. The negative is that there is a correlation between the two categorical columns.

In the next section of the article, I would introduce you to the data on which we are going to implement the chi-square test to see whether they are co-related.

## Introduction to data

We will create a random table for implementing the Chi-square test and check whether there is a relation between the columns. We will follow these steps to create a contingency table:

Step 1: Importing libraries to create an array and data frame:

`import numpy as np`
`import pandas as pd`

Step 2: Creating an array and converting that array to the data frame:

```np.random.seed(10)
# Sample data randomly at fixed probabilities
type_bottle = np.random.choice(a= ["paper","cans","glass","others","plastic"],
p = [0.05, 0.15 ,0.25, 0.05, 0.5],
size=1000)

# Sample data randomly at fixed probabilities
month = np.random.choice(a= ["January","February","March"],
p = [0.4, 0.2, 0.4],
size=1000)

bottles = pd.DataFrame({"types":type_bottle,
"months":month})

bottles_tab = pd.crosstab(bottles.types, bottles.months, margins = True)

bottles_tab.columns = ["January","February","March","row_totals"]

bottles_tab.index = ["paper","cans","glass","others","plastic","col_totals"]

observed = bottles_tab.iloc[0:5,0:3]   # Get table without totals for later use
bottles_tab```

After executing the above lines of codes, the output would look like this:

The above contingency table shows the different types of bottles (category) as an index that have been recycled in the month of January, February, March. Row_total and column_total show the total sum of each category (type of bottles) and month total collection respectively.

## Implementing the Chi-square test

Let’s implement the Chi-square test and check the independence of columns. So in the formula discussed above, we need two things: observed values and expected values. So we need to understand how to calculate these values. The observed value is nothing but the original data.

How to calculate expected values?

Multiply the row total to the column total and divide by the total number of observations for a cell to get the expected count. So, to calculate the expected counts in python we need to use the np.outer() function on row total and column total. At last, dividing the output of the np.outer() function by observations.

Now, as explained, let us code the above explanation in python.

```expected =  np.outer(bottles_tab["row_totals"][0:5],
bottles_tab.loc["col_totals"][0:3]) / 1000

expected = pd.DataFrame(expected)

expected.columns = ["Janurary","Feburary","March"]
expected.index = ["paper","cans","glass","others","plastic"]

expected
```

Here, we will write the formula in python to calculate the chi-square static value.

`chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()`
`print(chi_squared_stat)`

Note: We call .sum() twice, once to get the column sums and a second time to add the column sums together, returning the sum of the entire 2D table.

Next, we will calculate the p-value and critical value which will help to accept or reject the null hypothesis.

```critical_value= chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
df = 8)   # df= degree of freedom

print("Critical value:",critical_value)

p_value = 1 - chi2.cdf(x=chi_squared_stat,  # Find the p-value
df=8)
print("P value:",p_value)```

Note: The degrees of freedom for a test of independence equals the product of the number of categories in each variable minus 1. In this case we have a 5×3 table so df = 4×2 = 8.

In the above explanations, we have seen how does chi-square test work. It can be done by a quick method by just a single line of code, which is given below, without doing all of the above steps:

`scipy.stats.chi2_contingency(observed= observed)`

Finally, we get a p-value of 0.51847 which is greater than 0.5. Therefore, we will accept the null hypothesis that says there is no relationship between the features. The test result does not detect a significant relationship between the variables.

## Yate’s Correction

In the above explanation of the Pearson’s chi-square formula that was a fault which was corrected by Frank Yates, and it’s known as Yate’s correction or Yate’s Chi-Square. Yate’s Chi-square formula is given as:

Oi = an observed frequency

Ei = an expected (theoretical) frequency

N = number of distinct events

Yate corrected Pearson’s chi-square formula by subtracting the difference of the observed and expected value by 0.5 and the rest is the same as the previous.

To use Yate’s chi-square test, we can write the following line of code:

`scipy.stats.chi2_contingency(df, correction=True)  #"correction=True" to apply Yates' correction           `

## Conclusion

To analyze the relationship between variables in the dataset, we conducted a Chi-square test for independence in this article. We used SciPy and started with data from scratch. We could learn how to implement this test practically and how to make inferences about the data.

## The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Sourabh has worked as a full-time data scientist for an ISP organisation, experienced in analysing patterns and their implementation in product development. He has a keen interest in developing solutions for real-time problems with the help of data both in this universe and metaverse.

## Our Upcoming Events

24th Mar, 2023 | Webinar

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Meet the Winners of ‘Women in AI Leadership Awards’ at The Rising 2023

All submissions were assessed by our panel of editors and industry veterans, and awardees were selected after careful reviewing.

### ViperGPT vs GPT-4

ViperGPT uses Python code to interpret and solve image queries.

### Council Post: Future of Careers in AI (after the revolution of Generative AI)

The way many people work could be fundamentally changed by generative AI. Some people might be excited by this concept. What this entails for others may be a concern. In industries where automation is possible, there is no doubt that this technology has the potential to greatly boost productivity and save costs.

### What’s Fuelling Moore’s Law’s Progress in Semiconductors?

EDA (Electronic Design Automation) software, which is used for chip design, has become strategic in the face of geopolitical battles

### OpenAI Might Invite Legal Trouble

OpenAI discontinues Codex. A well thought out, smart but rushed move.

### What Mexico can Learn from India When it Comes to Instant Payment

Although similar to UPI, the Mexican platform still needs improvement in customer journey.

### Why Devs Can’t Turn Back After Using Tailwind CSS

Tailwind says reject semantic CSS and go toward utility CSS.

### SVB Fall Creates Ripple Effect, Indian IT Caught in the Wave

Jain spoke about the dangers of the abruptness with which the ripples had reached the banking sector.

### AI May Get Scarier, Govts Must Tame it in Time

With the booming adoption comes the urgent need for regulations to safeguard users from the negative impact of AI

### NVIDIA GTC 2023 Key Highlights

Here are some few key takeaways from the GTC 2023 event