Practices of data collection and analysis in industry and academics may not be outright fraud, but one cannot completely deny the existence of malpractices. We all know that the product is as good as the processing technique. The results in data science too are also highly dependent on the data analysis process. Data dredging or p-hacking is one of the most common ways in which data analysis is misused to find patterns that appear statistically significant but are not.
Data dredging is very difficult to spot and mainly affects the study in negative ways. P-hacking is unintentional cherry-picking of promising note-worthy data that can lead to an excess of significant and desirable results. However, it can have severe implications such as an increase in the number of false positives leading to the study’s retraction, misleading other operations, increased bias, and a gross waste of resources.
P-hacking is inevitable, but there can be safeguards in play that can help in reducing instances of data dredging and help avoid the trap.
Preregistration of the study
The best way to avoid p-hacking is to use preregistration. It will help avoid making any selections or tweaks in data after seeing it. However, it requires preparing a detailed test plan, including the statistical tools and analysis techniques to be applied to data. This plan can be registered with data on an online registry.
After the plan’s registration, one can carry out the test according to plan– without tweaking the data and reporting the results whatever they are in the registry. This will enhance the confidence of analysts in the study as they can check the plan online.
Avoid peeking on data and continuous observation
The curiosity of a data scientist about the test’s performance or its significant results, and consequently checking up on data mid-test, can lead to an increase in the number of false positives and affect the value of p majorly. Thus, a test must be allowed to run its course and should not be peeked into or stopped even if the desirable p-value is reached.
Bonferroni correction to address the problem
With the number of hypothesis tests performed increasing, the number of false positives also increases, becoming important to control this. Bonferroni correction can compensate for the increase in observing a rare event by testing every single hypothesis at a significant level. The method, however, becomes very strict when the number of hypothesis tests becomes very large. As a result, some true positives might be missed even when they exist above significance levels. So, there has to be a balance between increasing the power of the test and controlling false positives.
Steps during the test to control false positives:
- Deciding the statistical parameters(variances) before starting the test. If some variance comes up that could change the parameter, it should be noted in the study. The rationale behind it should also be noted.
- Deciding beforehand the number of replications and tests performed and at what level the sample will be excluded. This will help prevent terminating the test before achieving the desired results.
- While investigating multiple outcomes or comparisons, one must make sure that their statistics reflect that. If one comes across something unusual, then testing the hypothesis again could be a way to get an actual p-value.
Pointers for data analysis
To begin with, the data collected should be of superior quality to reduce the pure error. Additionally, an increase in sampling size might reduce the risks, but it is not always true. With an increase in sampling size, risks might also increase. A better model which would be more complex and accounts for co-variation should be used for data analysis.
No steps can guarantee an absolute elimination of data dredging but it can be certainly reduced to the level where it becomes insignificant.