Advertisement

Active Hackathon

Importance Of Hypothesis Testing In Data Science

Data Science has two parts to it “Data” and “Science”. Alone both are having their individual meanings but when it is combined together “Data” gets power. Yes, you heard it right, but the question here is how “Data” gets power? Data alone is not interesting, it Is the interpretation and insights from the data that make it worthy. How to achieve that is another question pondering in our minds. So I would say statistics is the answer to this question. 

Data Science is the most sought after career in the current era. Even college students want to build their career in Data Science. On similar notes, it would be right to quote Thomas H. Davenport and D.J. Patil in one of the Harvard Business Review articles that, “Data Scientist: The Sexiest Job of the 21st Century”.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

With evolving technologies and automation tools/algorithms have been in place which makes the creation of machine learning models fairly simple. Still, fundamental concepts are quite confusing and amongst them is Hypothesis Testing.

In this article, I am trying to clarify the concept of Hypothesis Testing and its importance in the world of Data Science.

Ronald Coase said “Torture the data, and it will confess to Anything”. For that confession of data, Hypothesis Testing could be used to interpret and draw conclusions about the population using sample data. A Hypothesis Test helps in making a decision as to which mutually exclusive statement about the population is best supported by sample data.

Let’s deep dive into the terminology used for Hypothesis Testing

Null Hypothesis (H0) – It is a statement that is commonly accepted or is considered to be the status quo. It is assumed that the observed result is due to the chance of factor. It is denoted by H0. If it is a test of means then we say that H0: µ1 = µ2 , which states that there is no significant difference in the 2 population means.

Alternate Hypothesis(H1 or Ha) – As previously mentioned that Null Hypothesis and Alternate Hypothesis are mutually exclusive statements. So if the Null Hypothesis is commonly accepted facts then the Alternate Hypothesis is a real fact-based on observation from the sample data. It is denoted by H1 or Ha. If it is a test of means then we say that H1 : µ1 ≠ µ2 , which states that there is a significant difference in 2 population means.

  • Critical Region – The critical region is defined as the region of values in distribution that leads to the rejection of the null hypothesis at some given probability level.
  • One-Tailed Test – A one-tailed test is a statistical hypothesis test in which the critical area of distribution is either greater than or less than a certain value, but can’t be both. For this the alternate hypothesis formulation is H1 : µ1 > µ2  or  H1 : µ1  < µ2 .
  • Two-Tailed Test – A two-tailed test is a statistical hypothesis test in which the critical area of distribution is on either of the sides. It tests whether the sample means of 2 or more populations are unequal (in the test of means). For this alternate hypothesis, the formulation is H1 : µ1 ≠ µ2 .

In either of the above 2 tests if the sample tested falls in the critical region than the alternate hypothesis holds to be true and the null hypothesis is rejected. The alternate hypothesis is made as a conclusive observation for the population-based on sample data.

Types of Test Statistics

 Test Statistics measure how close the sample has come to the null hypothesis. This observation differs from a random sample to a sample. A test statistic results contain insights about the data that helps in making the decision of whether to reject the null hypothesis or not.

There are different probability models for different types of populations. Based on probability distribution different hypothesis tests are selected.

Sample data is like a mirror image for the population. So, the sample data must provide sufficient evidence to reject the null hypothesis and conclude that the effect exists in the population. If it is not able to do so then effect doesn’t exist in the population and thus we would fail to reject the null hypothesis.

Nothing comes with perfection and so does data insights. If everything is perfect and with 100% accuracy then it is something overfitted or manipulated. It is not natural. If things are natural and robust then they come with a cost attached to it in the form of error. Therefore, we have error terms here. Let’s get to know these terms.

Error terms in hypothesis testing

Type-I error – This error occurs when insights drawn from sample data lead to rejection of the null hypothesis even when it is true. This error could be controlled as it has direct bearing with a level of significance.

Type-II error – This error occurs when insights from sample data result in failing to reject the null hypothesis although it is false. 

Level of Significance – It is the probability of making type I error and is denoted by α. It is the maximum probability of making type I error. As per standard for 95% confidence level value of alpha is 0.05. This means that there is a 5% probability of making type I error or rejecting the null hypothesis even when it is true.

p-value – It is a statistical concept that is used from hypothesis testing to regression to tree models and much more. It is an integral part of data science. If the p-value is high there are higher chances of the null hypothesis being true and if the p-value is low then it is more likely to reject the null hypothesis. The Standard p-value is equal to alpha and is used for checking statistical p-value against it and making the decision.

High P-Values: Your data are likely with a true null

Low P-Values: Your data are unlikely with a true null

Now since we have been pretty much acquainted with basic terminology used in hypothesis testing, let’s see how to use it in making business decisions. Steps followed are as below:

  1. Formulate the Null and Alternate Hypothesis
  2. Based on data and probability distribution select the hypothesis test to be performed
  3. Based on the business are and problem statement selects the level of significance if 0.05 (standard alpha) is not acceptable.
  4. Calculate test statistics on the sample data collected
  5.  Calculate the p-value
  6. Based on p-value draw insights to reject or fail to reject the null hypothesis.
  7. Draw your business conclusion.

If the above basic steps are performed in said sequence then the valuable business decisions could be made in no time and with precision. This method has helped many pharmaceutical drugs and medical procedures in testing. 

For example, if we talk about our IT industry – any extra work from the client based on the current resource capability could be decided if we perform hypothesis testing. This would help the business to make a decision whether an additional portfolio should be taken up from the client or not and if so then even the contract amount and terms could be decided such that company is in maximum profitable condition.

I hope this article has given you a kick start in understanding the hypothesis testing and its importance. I would like to conclude by quoting Daniel Keys Moran, “ You can have data without information, but you cannot have information without data”.


This article is presented by AIM Expert Network (AEN), an invite-only thought leadership platform for tech experts. Check your eligibility.

More Great AIM Stories

Netali Agrawal
Netali Agrawal is a part of the AIM Writers Programme. She is a Business Analyst who loves to explore new ideas in different industries through machine learning and artificial intelligence. She holds a bachelors degree in engineering along with post-graduation certification in business analytics and business intelligence. She is working with an MNC as a business analyst and leading a project for machine learning and artificial intelligence. Netali loves to write about analytics, machine learning and artificial intelligence. She loves to explore data and mould it in the best possible shape to get all possible insights from the data. She resides in Hyderabad, India. Linkedin Bio: www.linkedin.com/in/netali-agrawal-31192a71

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR
How Data Science Can Help Overcome The Global Chip Shortage

China-Taiwan standoff might increase Global chip shortage

After Nancy Pelosi’s visit to Taiwan, Chinese aircraft are violating Taiwan’s airspace. The escalation made TSMC’s chairman go public and threaten the world with consequences. Can this move by China fuel a global chip shortage?

Another bill bites the dust

The Bill had faced heavy criticism from different stakeholders -citizens, tech firms, political parties since its inception

So long, Spotify

‘TikTok Music’ is set to take over the online streaming space, but there exists an app that has silently established itself in the Indian market.

[class^="wpforms-"]
[class^="wpforms-"]