The longevity of any scientific domain relies on its openness to being falsifiable. In the case of machine learning (ML), which made a late entry into the scientific community, there seems to be a lack of replicability that otherwise exists in other scientific fields. Today, it is a daunting task even to verify a claim in a paper given that thousands of ML papers get published every week.
To establish an ecosystem that encourages ML researchers to volunteer for reproducibility of claimed results, the organizers of NeurIPS 2019 introduced new policies into their paper submission guidelines.
A report on the results from deployment of these components was recently published, where the authors – who happen to be researchers from top institutes and organizations – discussed their findings in detail.
Facilitating Reproducibility In ML
This renewed interest around replicability of results was kickstarted at last year’s NeurIPS conference, the premier international conference for research in ML. A reproducibility program was introduced, designed to improve the standards across the community and evaluate ML research.
As part of the paper submission process, the new program contained three components:
- a code submission policy,
- a community-wide reproducibility challenge, and
- a Machine Learning Reproducibility checklist
According to the authors, the results of this reproducibility experiment at NeurIPS 2019 could be summarized as follows:
- Indicating a success of code submission policy, NeurIPS witnessed a rise in several authors willingly submitting code. This increased from less than 50% a year ago, to nearly 75%.
- The authors claim that the number of participants in the reproducibility challenge continues to increase, suggesting the support for the movement.
The increase in code submissions can also be attributed to the NeurIPS 2019 policy, which states that it “expects code only for accepted papers”. So code submission is not mandatory, and the code is not expected to be used during the review process to decide on the soundness of the work.
Challenges To Reproducibility
Reproducibility is an essential characteristic for widespread adoption of any scientific method. In the case of ML, however, the process is not so straightforward and ML model’s black box nature is not helping either. There is also an overwhelming hype around AI, which can nudge the researchers into inflating the results for various personal reasons. Overclaiming results is a major headache, and one of the reasons we still do not see anything materialize beyond breaking news.
If we take the example of neural ODEs, which garnered accolades for its breakthrough results, one of the authors, David Duvenaud, exposed the paper for its flaws. This paper, which took the best paper award at NeurIPS 2018, was ripped apart by one of the main authors a year later, coincidentally at NeurIPS 2019!
While the neuralODE paper led to other breakthrough work, the author admitted that their paper had many inaccuracies. To the dismay of his audience, he even went on to explain how they thought of a ‘cool sounding name’ for their paper to get more eyeballs.
This clearly shows the perils of hype in any nascent field. Fortunately for the community, Duvenaud came out clean and set a precedent for veracity that so far has been a hit and miss.
That said, there are a few immediate challenges to reproducibility, and these can be summarized as follows:
- Same training data might not be accessible
- Misspecification training procedures in the paper
- No code or erroneous code
- Being lenient with the metrics
- Improper statistical testing, or using the wrong statistic tests.
- Overclaiming of results
When it comes to code submission, the following objections were commonly reported as per this report:
- Dataset confidentiality
- Proprietary software
- Computation infrastructure
In an interview published by Nature, Joelle Pineau – who is also one of the authors – brought the attention of the whole ML community towards reproducibility.
In reinforcement learning, said Pineau, if you do two runs of some algorithms with different initial random settings, you can get very different results. And, if you do a lot of runs, you are able to report only the best ones.
Pineau and her peers have provided the much-needed impetus to a movement that has already shown promising results, and hopefully, might translate into widespread adoption of fairness and transparency across the community.