Getting informative insights from the raw data in hand is vital in a successful machine learning project. The selection of the right machine learning algorithm and tuning of the model parameters to achieve better performance are possible only with proper data analytics in the pre-processing stage. Traditional statistical analysis is simple and powerful in extracting the essence out of the raw data.

Statistical analysis is performed reliably and quickly with statistical software packages. The famous multi-purpose language, Python, has a great collection of libraries and modules to do statistical analysis in a lucid way. In this article, we discuss a widely used statistical tool called ANOVA with hands-on Python codes.

ANOVA is one of the statistical tools that helps determine whether two or more data samples have significantly identical properties. Let’s assume a scenario- we have different samples collected independently from the same dataset for cross-validation. We wish to know whether the means of the collected samples are significantly the same. Another scenario- we have developed three different machine learning models. We have obtained a set of results, and we wish to know whether the models perform significantly in the same manner. Thus, there are many scenarios in practical applications where we may need to use ANOVA as part of data analytics.

ANOVA is the acronym for Analysis of Variance. It analyzes variations among different groups and within those groups of a dataset (technically termed as population). However, there are some assumptions that the data must hold to use ANOVA. They are as follows:

- The data follows normal distribution
- The variance of data is the same for all groups.
- Data among groups are independent of each other.

Math concept behind ANOVA and its usage can be explored with the following hands-on Python example.

## Comparing Means using ANOVA

Import the necessary libraries to create the environment.

# import libraries import numpy as np import pandas as pd import scipy import statsmodels.api as sm from statsmodels.formula.api import ols from matplotlib import pyplot as plt import seaborn as sns

Generate some normally distributed synthetic data using NumPy’s random module. While generating synthetic data, we should ensure that the standard deviation is common for all different methods.

# numpy random normal arguments: (mean, std_dev, size) method_1 = np.random.normal(10,3,10) method_2 = np.random.normal(11,3,10) method_3 = np.random.normal(12,3,10) method_4 = np.random.normal(13,3,10) # build a pandas dataframe data = pd.DataFrame({'method_1':method_1, 'method_2':method_2, 'method_3':method_3, 'method_4':method_4}) data.head()

Output:

Before proceeding further into ANOVA, we should establish a null hypothesis. Whenever we are unable to make a solid mathematical decision, we go for hypothesis testing. ANOVA does follow hypothesis testing. Our null hypothesis (common for most ANOVA problems) can be expressed as:

Means of all the four methods are the same.

We know very well that the means are mathematically not the same. We set 10, 11, 12 and 13 as the means for the corresponding four methods while generating data. But from a statistical point of view, we make decisions with some level of significance. We set the most common level of significance, 0.05 (i.e. 5% of risk in rejecting the null hypothesis when it is actually true).

In other words, if we set a level of significance of zero, it is a mathematical decision – we do not permit errors. In our case, we can reject the null hypothesis without any analysis, because we know that the means are different from each other. However, with many factors affecting the data, we should give some space to accept some statistically significant deviations among data.

ANOVA follows F-test (We will define F-statistic shortly). If the probability of F-statistic is less than or equal to the level of significance (0.05, here), we should reject the null hypothesis. Else, we should accept the null hypothesis.

Make the data frame to have a single column of values using Pandas’ melt method.

df = pd.melt(data, value_vars=['method_1', 'method_2', 'method_3', 'method_4']) df.columns = [ 'treatment', 'value'] # treatment refers to the method df.sample(10)

Output:

Develop an Ordinary Least Squares model with the melted data.

model = ols('value~C(treatment)', data=df).fit() model.summary()

Output:

We can jump into conclusions with this step itself. The probability score is 0.135, which is greater than 0.05. Hence, we should accept the null hypothesis. In other words, the means of all four methods are significantly the same. However, an ANOVA table can give crystal clear output for better understanding. Obtain the ANOVA table with the following code.

anova = sm.stats.anova_lm(model, typ=1) anova

Output:

Users need to be aware that the terms groups and methods are invariably used in this example.

We have come to the conclusion based on the Probability score. However, we can also arrive at the conclusion based on the F-statistic also. We can calculate the critical value of F-statistic with the following code.

# arguments: f(numerator degrees of freedom, denominator degrees of freedom) # arguments: ppf(1-level of significance) scipy.stats.f(3,36).ppf(0.95)

Output:

If the observed F-statistic is greater than or equal to its critical value, we should reject the null hypothesis. Else, if the observed F-statistic is less than its critical value, we should accept the null hypothesis. Here the observed value 1.975314 is less than the critical value 2.86626. Therefore, we accept the null hypothesis.

We can visualize the actual data to get some better understanding.

sns.set_style('darkgrid') data.plot() plt.xlabel('Data points') plt.ylabel('Data value') plt.show()

Output:

We can see a great overlap among different data groups. This is exactly where we cannot jump into conclusions in a mathematical way. Statistical tools help take successful business decisions in these tough scenarios.

How does Means vary among different groups? Let’s visualize it too.

data.mean(axis=0).plot(kind='bar') plt.xlabel('Methods') plt.ylabel('Mean value') plt.show()

Output:

Though we see some differences in the mean values with human eyes, statistics say there are no significant differences in the mean values!

## A Major Limitation of ANOVA

There is a big problem with the ANOVA method when we reject the null hypothesis. Let’s study that with some code examples. Increase the mean value of method_4 from 13 to 15.

# Alter the mean value of method_4 method_1 = np.random.normal(10,3,10) method_2 = np.random.normal(11,3,10) method_3 = np.random.normal(12,3,10) method_4 = np.random.normal(15,3,10) data = pd.DataFrame({'method_1':method_1, 'method_2':method_2, 'method_3':method_3, 'method_4':method_4}) data.head()

Output:

Melt the data to have single-columned values.

df = pd.melt(data, value_vars=['method_1', 'method_2', 'method_3', 'method_4']) df.columns = [ 'treatment', 'value'] df.sample(10)

Output:

Develop the Ordinary Least Squares model.

model = ols('value~C(treatment)', data=df).fit() model.summary()

Output:

Obtain the ANOVA table.

anova = sm.stats.anova_lm(model, typ=1) anova

Output:

Since the probability score is less than the level of significance, 0.05, we do reject the null hypothesis. It means that at least one mean value is different from the others. But we cannot identify the method or methods whose means are different from the others. This is where ANOVA needs some other methods to bring light upon its decisions.

This issue can be tackled with the help of Post Hoc Analysis.

## Post Hoc Analysis

Post Hoc Analysis is also known as the Tukey-Kramer method or the Tukey test or the Multi-Comparison test. Whenever we reject the null hypothesis in an ANOVA test, we explore individual comparisons among the mean values of different groups (methods) using the Post Hoc Analysis.

Import the necessary module from the statsmodels library.

from statsmodels.stats.multicomp import MultiComparison comparison = MultiComparison(df['value'], df['treatment']) tukey = comparison.tukeyhsd(0.05) tukey.summary()

Output:

This method performs ANOVA individually between every possible pair of groups. It yields individual decisions with probability scores.

Here, the null hypothesis is accepted (means are significantly the same) for the pairs:

method_1 and method_2

method_1 and method_3

method_2 and method_3

On the other hand, null hypothesis is rejected (means are significantly different) for the pairs:

method_1 and method_4

method_2 and method_4

method_3 and method_4

Hence, we can conclude that methods 1, 2 and 3 possess significantly the same means while method 4 differs from them all.

Note: We have generated data with NumPy’s random module without any seed value. Hence, the values and results in these examples are not reproducible.

This Colab Notebook has the above code implementation.

## Wrapping up

In this article, we discussed the importance of statistical tools, especially ANOVA. We discussed the concepts of ANOVA with hands-on Python codes. We also studied the limitations of ANOVA and the Post Hoc Analysis method to overcome the same. Now, it is your turn to perform ANOVA with the raw data in your hand!