MITB Banner

What Is P-Hacking & How To Avoid It?

P-hacking is one of the most common ways in which data analysis is misused to find patterns that appear statistically significant but are not.

Share

Practices of data collection and analysis in industry and academics may not be outright fraud, but one cannot completely deny the existence of malpractices. We all know that the product is as good as the processing technique. The results in data science too are also highly dependent on the data analysis process. Data dredging or p-hacking is one of the most common ways in which data analysis is misused to find patterns that appear statistically significant but are not. 

Data dredging is very difficult to spot and mainly affects the study in negative ways. P-hacking is unintentional cherry-picking of promising note-worthy data that can lead to an excess of significant and desirable results. However, it can have severe implications such as an increase in the number of false positives leading to the study’s retraction, misleading other operations, increased bias, and a gross waste of resources.

P-hacking is inevitable, but there can be safeguards in play that can help in reducing instances of data dredging and help avoid the trap.

Preregistration of the study

The best way to avoid p-hacking is to use preregistration. It will help avoid making any selections or tweaks in data after seeing it. However, it requires preparing a detailed test plan, including the statistical tools and analysis techniques to be applied to data. This plan can be registered with data on an online registry. 

After the plan’s registration, one can carry out the test according to plan– without tweaking the data and reporting the results whatever they are in the registry. This will enhance the confidence of analysts in the study as they can check the plan online.

Avoid peeking on data and continuous observation

 The curiosity of a data scientist about the test’s performance or its significant results, and consequently checking up on data mid-test, can lead to an increase in the number of false positives and affect the value of p majorly. Thus, a test must be allowed to run its course and should not be peeked into or stopped even if the desirable p-value is reached. 

Bonferroni correction to address the problem

With the number of hypothesis tests performed increasing, the number of false positives also increases, becoming important to control this. Bonferroni correction can compensate for the increase in observing a rare event by testing every single hypothesis at a significant level. The method, however, becomes very strict when the number of hypothesis tests becomes very large. As a result, some true positives might be missed even when they exist above significance levels. So, there has to be a balance between increasing the power of the test and controlling false positives.

Steps during the test to control false positives:

  • Deciding the statistical parameters(variances) before starting the test. If some variance comes up that could change the parameter, it should be noted in the study. The rationale behind it should also be noted.
  • Deciding beforehand the number of replications and tests performed and at what level the sample will be excluded. This will help prevent terminating the test before achieving the desired results.
  • While investigating multiple outcomes or comparisons, one must make sure that their statistics reflect that. If one comes across something unusual, then testing the hypothesis again could be a way to get an actual p-value.

Pointers for data analysis

To begin with, the data collected should be of superior quality to reduce the pure error. Additionally, an increase in sampling size might reduce the risks, but it is not always true. With an increase in sampling size, risks might also increase. A better model which would be more complex and accounts for co-variation should be used for data analysis. 

No steps can guarantee an absolute elimination of data dredging but it can be certainly reduced to the level where it becomes insignificant.

Share
Picture of Meenal Sharma

Meenal Sharma

I am a journalism undergrad who loves playing basketball and writing about finance and technology. I believe in the power of words.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.