Suppose you have ever tried replicating a state-of-the-art or any decent machine learning paper. In that case, you might have probably run into packages and libraries issues, version issues, hardware and many other challenges, suggesting that reproducibility in ML is a serious problem.
Even a reproducibility program was introduced at neurIPS 2019 that required researchers to consider the following:
- a code submission policy,
- a community-wide reproducibility challenge, and
- a Machine Learning Reproducibility checklist
Recently, Grigori Fursin, a computer scientist, has posted about the checklists to keep in mind if researchers care about reproducibility.
In a recent talk, Fursin shared his experience of reproducing 150+ systems and ML papers during artifact evaluation at ASPLOS, MLSys, CGO, PPoPP and Supercomputing. “Our long-term goal is to help researchers share their new ML techniques as production-ready packages along with published papers and participate in collaborative and reproducible benchmarking, co-design and comparison of efficient ML/software/hardware stacks,” he said.
In order to make it easy for the reviewers, accessing artifacts should be made clear:
- Whether to clone the repository from GitHub, GitLab, BitBucket or any similar service
- Downloading package from a public or private website
- Allow access to artifacts via a private machine with pre-installed software when access to rare hardware is required, or proprietary software is used
Fursin also advises describing the approximate disk space required after unpacking the artifact so as to avoid unnecessary software packages to the VM images.
Changing Anything Changes Everything or the CACE principle is one of the heuristics for managing software products. This refers to the dependency of every change we make in a pipeline. Researchers should describe any specific hardware and specific features required to evaluate artifacts like vendor, CPU/GPU/FPGA, number of processors/cores, interconnect, memory, hardware counters, OS and software packages.
“This is particularly important if you share your source code and it must be compiled or if you rely on some proprietary software that you can not include to your package. In such a case, we strongly suggest you describe how to obtain and to install all third-party software, data sets and models,” wrote Fursin.
Datasets, Models And Installation
If the datasets are large or proprietary, then it is advisable to add details of how to download the dataset. If proprietary, then reviewers should be provided with a public alternative subset for evaluation. The same goes for models as well. If third-party models are not included in packages (for example, they are very large or proprietary), then provide details about how to download and install them, describe the setup procedures for the artifacts.
Experiment Workflow And Evaluation
Describe the experimental workflow and how it is implemented, invoked and customised (if needed), i.e. some OS scripts, IPython/Jupyter notebook, portable CK workflow, etc. Also, describe all the steps necessary to evaluate artifacts using the workflow above. Describe the expected result and the maximum allowable variation of empirical results (particularly important for performance numbers and speed-ups).
Customisation is not always cool. This recommendation is more of an option but not always unimportant. If possible, describe how to customise the workflow, i.e. if it is possible to use different data sets, benchmarks, real applications, predictive models, software environment (compilers, libraries, run-time systems), hardware, etc. Also, describe if it is possible to parameterise the workflow (whatever is applicable such as changing number of threads, applying different optimisations, CPU/GPU frequency, autotuning scenario, model topology, etc).