What Is P-Hacking & How To Avoid It?

P-hacking is one of the most common ways in which data analysis is misused to find patterns that appear statistically significant but are not.

Practices of data collection and analysis in industry and academics may not be outright fraud, but one cannot completely deny the existence of malpractices. We all know that the product is as good as the processing technique. The results in data science too are also highly dependent on the data analysis process. Data dredging or p-hacking is one of the most common ways in which data analysis is misused to find patterns that appear statistically significant but are not. 

Data dredging is very difficult to spot and mainly affects the study in negative ways. P-hacking is unintentional cherry-picking of promising note-worthy data that can lead to an excess of significant and desirable results. However, it can have severe implications such as an increase in the number of false positives leading to the study’s retraction, misleading other operations, increased bias, and a gross waste of resources.

P-hacking is inevitable, but there can be safeguards in play that can help in reducing instances of data dredging and help avoid the trap.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Preregistration of the study

The best way to avoid p-hacking is to use preregistration. It will help avoid making any selections or tweaks in data after seeing it. However, it requires preparing a detailed test plan, including the statistical tools and analysis techniques to be applied to data. This plan can be registered with data on an online registry. 

After the plan’s registration, one can carry out the test according to plan– without tweaking the data and reporting the results whatever they are in the registry. This will enhance the confidence of analysts in the study as they can check the plan online.

Download our Mobile App

Avoid peeking on data and continuous observation

 The curiosity of a data scientist about the test’s performance or its significant results, and consequently checking up on data mid-test, can lead to an increase in the number of false positives and affect the value of p majorly. Thus, a test must be allowed to run its course and should not be peeked into or stopped even if the desirable p-value is reached. 

Bonferroni correction to address the problem

With the number of hypothesis tests performed increasing, the number of false positives also increases, becoming important to control this. Bonferroni correction can compensate for the increase in observing a rare event by testing every single hypothesis at a significant level. The method, however, becomes very strict when the number of hypothesis tests becomes very large. As a result, some true positives might be missed even when they exist above significance levels. So, there has to be a balance between increasing the power of the test and controlling false positives.

Steps during the test to control false positives:

  • Deciding the statistical parameters(variances) before starting the test. If some variance comes up that could change the parameter, it should be noted in the study. The rationale behind it should also be noted.
  • Deciding beforehand the number of replications and tests performed and at what level the sample will be excluded. This will help prevent terminating the test before achieving the desired results.
  • While investigating multiple outcomes or comparisons, one must make sure that their statistics reflect that. If one comes across something unusual, then testing the hypothesis again could be a way to get an actual p-value.

Pointers for data analysis

To begin with, the data collected should be of superior quality to reduce the pure error. Additionally, an increase in sampling size might reduce the risks, but it is not always true. With an increase in sampling size, risks might also increase. A better model which would be more complex and accounts for co-variation should be used for data analysis. 

No steps can guarantee an absolute elimination of data dredging but it can be certainly reduced to the level where it becomes insignificant.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Meenal Sharma
I am a journalism undergrad who loves playing basketball and writing about finance and technology. I believe in the power of words.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: From Promise to Peril: The Pros and Cons of Generative AI

Most people associate ‘Generative AI’ with some type of end-of-the-world scenario. In actuality, generative AI exists to facilitate your work rather than to replace it. Its applications are showing up more frequently in daily life. There is probably a method to incorporate generative AI into your work, regardless of whether you operate as a marketer, programmer, designer, or business owner.