Today, as a data science professional, we all have heard of the buzz word Hypothesis Testing. Hypothesis Testing is basically an assumption that we make about the population parameter. We should know when to use which fundamental test for statistical analysis.
This article is an attempt to check under what condition we can go for a Z -Test or a T-Test. We will further implement these tests in python.
The dataset is downloaded from here.
Null Hypothesis: Population mean is same as the sample mean
Alternate Hypothesis: Population mean is not the same as the sample mean
Using the below formula we can calculate the z-statistic:
z = (x — μ) / (σ / √n)
x= sample mean
σ / √n = standard deviation of population
If the p-value is lower than 0.05, reject the hypothesis or else accept the null hypothesis.
One-Sample Z test
Let’s take a mean of 156 for this blood pressure dataset.
Null Hypothesis: There is no difference in the mean
Alternate Hypothesis: Means are different
import pandas as pd from scipy import stats from statsmodels.stats import weightstats as stests ztest, pval = stests.ztest(df['bp_before'], x2=None, value=156) print(float(pval)) if pval<0.05: print("reject null hypothesis") else: print("accept null hypothesis")
From the above result we can see p-value is greater than 0.05 so, the null hypothesis is accepted.
Two Sample Z-test
H0: mean of two samples is the same
H1: mean of two samples is not the same
ztest ,pval1 = stests.ztest(df['bp_before'], x2=df['bp_after'], value=0,alternative='two-sided') print(float(pval1)) if pval1<0.05: print("reject null hypothesis") else: print("accept null hypothesis")
The p-value is greater than 0.05 so the null hypothesis is rejected. There is a significant difference between the mean of the two groups.
The dataset can be downloaded from here.
The T-test is used to compare the mean of two given groups. The sample follows the Gaussian distribution. A t-test is used when parameters like the standard deviation of the population are not known.
We can calculate the t statistics by the given formula
t = (x1 — x2) / (σ / √n1 + σ / √n2)
x1 = sample 1 mean
x2 = sample 2 mean
n1 = sample 1 size
n2 = sample 2 size
The mass of a sample of n=20 are m = 8.8, 6.6, 9.5, 11.2, 10.2, 7.4, 8.0, 9.6, 9.9, 9.0, 7.6, 7.4, 10.4, 11.1, 8.5, 10.0, 11.6, 10.7, 10.3, and 7.0 g.We need to check if there is any difference between the average mass of this sample and the average mass of all acorns of μ = 10.0 g.
Null Hypothesis: x̄ – μ = 0, that is there is no significant difference.
Alternate Hypothesis: x̄ – μ ≠ 0 (two-sided test)
t-critical for specified alpha level: t*= 2.093
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import scipy.stats as stats # some useful stuff wine_data = pd.read_csv("winemag-data-130k-v2.csv") x = wine_data['points'] mu = x.mean() sigma = x.std(ddof=0) print("mu: ", mu, ", sigma:", sigma) x = np.random.normal(loc=9.2,scale=1.5,size=30).round(1) print(x) #One Sample t test x = [8.8, 6.6, 9.5, 11.2, 10.2, 7.4, 8.0, 9.6, 9.9, 9.0, 7.6, 7.4, 10.4, 11.1, 8.5, 10.0, 11.6, 10.7, 10.3, 7.0] mu = 10 t_critical = 2.093 x_bar = np.array(x).mean() s = np.array(x).std(ddof=1) # subtract 1 from N to get unbiased estimate of sample standard deviation N = len(x) SE = s/np.sqrt(N) t = (x_bar - mu)/SE print("t-statistic: ",t) # A one sample t-test that gives you the p-value too can be done with scipy as follows: t, p = stats.ttest_1samp(x, mu) print("t = ", t, ", p = ", p) if p<0.05: print("reject null hypothesis") else: print("accept null hypothesis")
p is lesser in magnitude than 0.05 we need to reject the null hypothesis. There is a statistically significant difference between the sample mean and the population mean of 10 g.
The mass of N1=20 acorns and N2=30 acorns from oak trees downwind from the same coal power plant is measured.
Null Hypothesis: x̄1 = x̄2, or x̄2 – x̄1 = 0, that is, there is no difference between the sample means
Alternate Hypothesis: x̄2 < x̄1, or x̄2 – x̄1 < 0 there is a difference between the sample means
# sample up wind x1 = [10.8, 10.0, 8.2, 9.9, 11.6, 10.1, 11.3, 10.3, 10.7, 9.7, 7.8, 9.6, 9.7, 11.6, 10.3, 9.8, 12.3, 11.0, 10.4, 10.4] # sample down wind x2 = [7.8, 7.5, 9.5, 11.7, 8.1, 8.8, 8.8, 7.7, 9.7, 7.0, 9.0, 9.7, 11.3, 8.7, 8.8, 10.9, 10.3, 9.6, 8.4, 6.6, 7.2, 7.6, 11.5, 6.6, 8.6, 10.5, 8.4, 8.5, 10.2, 9.2] # equal sample size and assume equal population variance t_critical = 1.677 X1 = len(x1) X2 = len(x2) t1 = X1-1 t2 = X2-1 df = t1+t2 s1 = np.std(x1,ddof=1) s2 = np.std(x2,ddof=1) x1_bar = np.mean(x1) x2_bar = np.mean(x2) sp = np.sqrt((t1*s1**2 + t2*s2**2)/df) se = sp*np.sqrt(1/X1 + 1/X2) t = (x2_bar - x1_bar)/(se) print("t-statistic", t) # a two-sample independent t-test is done with scipy as follows # NOTE: the p-value given is two-sided so the one-sided p value would be p/2 t, p_twosided = stats.ttest_ind(x2, x1, equal_var=True) print("t = ",t, ", p_twosided = ", p_twosided, ", p_onesided =", p_twosided/2)
p is lesser in magnitude than 0.05 we need to reject the null hypothesis. There is a statistically significant difference between the sample mean of the two different samples.
The paired sample t-test is also called a dependent sample t-test. Let’s take an example from a blood pressure dataset. We need to check the sample means of blood pressure of an individual before and after treatment.
H0: The mean difference between the two samples is 0
H1: The mean difference between the two samples is not 0
import pandas as pd from scipy import stats df = pd.read_csv("ztest.csv") df[['bp_before','bp_after']].describe() ttest,pval = stats.ttest_rel(df['bp_before'], df['bp_after']) print(pval) if pval<0.05: print("reject null hypothesis") else: print("accept null hypothesis")
As p-value is less than 0.05 mean values of the two groups are not the same.
This post could give us an overview of when to use z-test and t-test in statistical tests. We can further extend our analysis by discussing the other statistical tests like ANOVA and Chi-Square Test. Finally, we came to the end of this article. I hope this article would have helped.
The complete code of the above implementation is available at the AIM’s GitHub repository. Please visit this link to find the notebook of this code.