Now Reading
Complete Guide to Goodness-of-Fit Test using Python

Complete Guide to Goodness-of-Fit Test using Python

Goodness-of-Fit cover

A good Data Scientist knows how to handle the raw data correctly. She/he never makes improper assumptions while performing data analytics or machine learning modeling. This is one of the secrets with which a Data Scientist succeeds in a race. For instance, the ANOVA test commences with an assumption that the data is normally distributed. Maximum Likelihood Estimation makes an a-priori assumption about the data distribution and tries to find out the most likely parameters. What if the assumptions about data distribution in the above cases are incorrect? We might jump to wrong conclusions and proceed with further data analysis or machine learning modeling in the wrong direction. With unexpected results, we might try to fine-tune the hyper-parameters of the model to improve performance, while the mistake has been with the assumption of data distribution.

One of the traditional statistical approaches, the Goodness-of-Fit test, gives a solution to validate our theoretical assumptions about data distributions. This article discusses the Goodness-of-Fit test with some common data distributions using Python code. Let’s dive deep with examples.

Register for our upcoming Masterclass>>

Import necessary libraries and modules to create the Python environment.

 # create the environment
 import numpy as np
 import pandas as pd
 import matplotlib.pyplot as plt
 import seaborn as sns
 from scipy import stats  

Uniform Distribution

Let us assume we have dice in our hand. A dice has six faces and six distinct possible outcomes ranging from 1 to 6 if we toss it once. An unbiased dice has equal probabilities for all possible outcomes. To check whether the dice in our hand is unbiased, we toss them 90 times (more trials ensure that the outcomes are statistically significant) and note down the counts of outcomes.

 path = 'https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/Tabular/uniform_dice.csv'
 dice = pd.read_csv(path)
 dice 

Output:

Looking for a job change? Let us help you.
input data distribution

Since each face of the dice is assumed to have equal probabilities, the outcomes must be uniformly distributed. Hence we can express the null hypothesis at 5% level of significance as follows:

The dice is unbiased and its outcomes follow uniform distribution

Following an ideal uniform distribution, expected frequencies can be derived by giving equal weightage to each outcome.

 # Total frequency
 total_freq = dice['observed'].sum()
 print('Total Frequency : ', total_freq)
 # Expected frequency
 expected_freq = total_freq / 6
 print('Expected Frequency : ', expected_freq) 

Output:

 # build up dataframe with expected frequency
 dice['expected'] = expected_freq
 dice 

Output:

Let us visualize the data distribution.

 sns.set_style('darkgrid')
 plt.figure(figsize = (6,6))

 # plot observed frequency
 plt.subplot(211)
 plt.bar(dice['face'], dice['observed'])
 plt.ylabel('Observed Frequency')
 plt.ylim([0,20])
 # plot expected frequency

 plt.subplot(212)
 plt.bar(dice['face'], dice['expected'])
 plt.ylabel('Expected Frequency')
 plt.xlabel('Face of dice')
 plt.ylim([0,20])
 plt.show() 

Output:

Uniform distribution

It is the right time for us to discuss how the Goodness-of-Fit test works. Under ideal conditions, the outcomes’ frequency should be identical to the expected frequency. But, the observed frequency differs a little from the expected frequency. Goodness-of-Fit test evaluates whether this variation is significantly acceptable. In other words, it tests how far the observed data fits to the expected distribution.

This closeness in fit (goodness-of-fit) is calculated with a parameter called Chi-Square. Mathematically, it is expressed as:

chi-square distribution

If there is more deviation between the observed and expected frequencies, the value of Chi-Square will be more. If the observed frequencies match the expected frequencies exactly, its value will be zero. therefore, a value close to zero denotes more closeness in the fit.

We can define a helper function to calculate the Chi-Square value.

 # a helper function to calculate the Chi-Square value
 def Chi_Square(obs_freq, exp_freq):
   count = len(obs_freq)
   chi_sq = 0
   for i in count:
     x = (obs_freq[i] - exp_freq[i]) ** 2
     x = x / exp_freq[i]
     chi_sq += x
   return chi_sq 

The Chi-Square value for our example is calculated as follows.

 # calculate using the helper function
 Chi_Square(dice['observed'], dice['expected']) 

Output:

chi-square goodness-of-fit

It should be noted that SciPy’s stats module can calculate the same as below.

 # calculate using the stats module of SciPy library
 stats.chisquare(dice['observed'], dice['expected']) 

Output:

chi-square goodness-of-fit

To conclude the null hypothesis, we have to compare the calculated Chi-Square value with the critical Chi-Square value. The critical Chi-Square value can be calculated using SciPy’s stats module. It takes as arguments (1 – level-of-significance, degrees of freedom). Degrees of freedom for Chi-Square is calculated as:

DOF = Number of outcomes - p - 1

Here, p refers to the number of parameters that the distribution has. For uniform distribution, p=0; for poisson distribution, p=1; for normal distribution, p=2.

Critical Chi-Square value is determined using the code,

 # critical Chi-Square - percent point function
 p = 0
 DOF = len(dice['observed']) - p - 1
 stats.chi2.ppf(0.95, DOF) 

Output:

chi-square goodness-of-fit

If the calculated Chi-Square value is more than or equal to the critical value, the null hypothesis should be rejected. On the other hand, if the calculated Chi-Square value is less than the critical value, the null hypothesis should not be rejected.

Here, for our problem, the calculated value of 2.8 is much lesser than the critical value of 11.07. Hence, we cannot reject the null hypothesis, i.e., the observed distribution significantly follows a uniform distribution.

An important condition imposed by the Goodness-of-Fit test is that the expected frequency of any outcome should be more than or equal to 5. If any outcome has an expected frequency less than 5, it should be combined (added) with its adjacent outcome to have significance in the frequency.

Normal Distribution

A bulb manufacturer wants to know whether the life of the bulbs follows the normal distribution. Forty bulbs are randomly sampled, and their life, in months, are observed.

 path = 'https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/Tabular/bulb_life.csv'
 data = pd.read_csv(path)
 data.head(10) 

Output:

We can visualize the data using Seaborn’s histplot method.

 sns.histplot(data=data, x='life', bins=8)
 plt.show() 

Output:

visualization of distribution

The data cannot be assured, with bare eyes, to be normally distributed. We know that a random variable that follows normal distribution is continuous. Hence, we can easily define bin intervals such that each bin should have at least five as its expected frequency. Here, in our problem there are 40 sample bulbs. To have five expected samples in each bin, we should have exactly 40/5 = 8 bins in total.

Find the bin interval to have five expected frequencies per bin.

 # mean and standard deviation of given data
 mean = np.mean(data['life'])
 std = np.std(data['life'])
 bins = 8
 interval = []
 for i in range(1,9):
   val = stats.norm.ppf(i/bins, mean, std)
   interval.append(val)
 interval 

Output:

The distribution ranges from negative infinity to positive infinity. Include negative infinity in the above list.

 interval.insert(0, -np.inf)
 interval 

Output:

normal distribution intervals

To calculate the observed frequency, we can just count the number of outcomes in these intervals. First, create a data frame with 8 intervals as below.

 df = pd.DataFrame({'lower_limit':interval[:-1], 'upper_limit':interval[1:]})
 df 

Output:

intervals

Create two columns each for observed and expected frequency. Use Pandas’ apply method to calculate the observed frequency between intervals.

 life_values = list(sorted(data['life']))
 df['obs_freq'] = df.apply(lambda x:sum([i>x['lower_limit'] and i<=x['upper_limit'] for i in life_values]), axis=1)
 df['exp_freq'] = 5
 df 

Output:

dataframe

We are now ready to perform the Goodness-of-Fit test. We can state our null hypothesis at a 5% level of significance as:

The bulb life follows normal distribution.

Calculate the actual Chi-Square value using the chisquare method available in SciPy’s stats module.

stats.chisquare(df['obs_freq'], df['exp_freq'])

Output:

chi-square goodness-of-fit

Calculate the critical Chi-Square value using the chi2.ppf method available in SciPy’s stats module.

 p = 2    # number of parameters
 DOF = len(df['obs_freq']) - p -1
 stats.chi2.ppf(0.95, DOF) 

Output:

chi-square goodness-of-fit

It is observed that the calculated Chi-Square value 6.4 is less than the critical value 11.07. Hence, the null hypothesis can not be rejected. In other words, the life of bulbs are normally distributed.

Find the Colab Notebook with the above code implementation here.

Find the above used CSV datasets here

Wrapping Up

The goodness-of-Fit test is a handy approach to arrive at a statistical decision about the data distribution. It can be applied for any kind of distribution and random variable (whether continuous or discrete). This article discussed two practical examples from two different distributions. In those cases, the assumed distribution became true as per the Goodness-of-Fit test. In the case of failure of assumption, the assumption about distribution should be changed suitably and be proceeded again with the Goodness-of-Fit test. 

It is your turn to find the true distribution of your data!

References:

  1. Probability and Statistics for Engineers and Scientists
  2. SciPy’s stats module – Official documentation
  3. Read on Wikipedia
  4. Watch on YouTube
What Do You Think?

Join Our Discord Server. Be part of an engaging online community. Join Here.


Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top