Recently, Data Version Control 1.0 was released that claims to bring agility, reproducibility as well as collaboration into the existing data science workflow. Data Version Control or DVC is an open-source tool which works as version control for data science and machine learning projects.
After 3 long years of planning to finalise the requirements for DVC 1.0 and stabilise the commands (API) and DVC file formats, Dmitry Petrov, Co-Founder and CEO at Iterative.ai released this version control, also known as Git for data projects. In technical terms, this version control codifies data and machine learning pipelines as text metafiles with the pointers to actual data in S3/GCP/Azure/SSH while using the Git for the actual versioning.
According to a blog post, Dmitry stated that the new DVC 1.0. is inspired by discussions and contributions from the community of data scientists, ML engineers, developers and software engineers.
How It Works
DVC is a data and ML experiments management tool that takes advantage of the existing engineering toolsets that you’re already familiar with (Git, CI/CD, etc.)
According to Dmitry, he learnt 5 important lessons while building the open-source ML tool. They are mentioned below-
- Multi-stage DVC files- In DVC 1.0, the DVC file format was changed in three big ways. First, instead of multiple DVC stage files (*.dvc), each project has a single DVC file dvc.yaml. Secondly, there is a clear connection between the dvc run command, where pipeline stages are defined, and how stages appear in dvc.yaml. Thirdly, data hash values are no longer stored in the pipeline metafile.
- Run Cache- The advantage of the run-cache is that pipeline runs are not directly connected to Git commits anymore, and the new DVC can store all the runs in run-cache – even if they were never committed to Git.
- Plots- This function is designed not only for visualising the current state of the project but also for comparing plots across the Git history.
- Data Transfer Optimisations- In DataOps, data transfer speed is hugely important, and now DVC can choose the optimal strategy for traversing the data remotely.
- Hyperparameter Tracking- Hyperparameter tracking is an important step to support configuration files and ML experiments.
Features of DVC
This version control tool brings in a number of features on the table. They are mentioned below:
- Git-Compatible- DVC runs on top of any Git repository and is compatible with any standard Git server or provider (GitHub, GitLab, etc). The tool offers all the advantages of a distributed version control system, including lock-free, local branching, and versioning.
- Storage Agnostic- A user can use Amazon S3, Microsoft Azure Blob Storage, Google Drive, Google Cloud Storage, Aliyun OSS, SSH/SFTP, HDFS, HTTP, network-attached storage, or disc to store data.
- Reproducible- DVC guarantees reproducibility by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment.
- Low Friction Branching- DVC fully supports instantaneous Git branching, even with large files.
- Metric Tracking- DVC includes a command to list all branches, along with metric values, to track the progress or pick the best version.
- ML Pipeline Framework- DVC has a built-in way to connect ML steps into a DAG and run the full pipeline end-to-end.
- Language & Framework-Agnostic- DVC is language- & framework-agnostic and any programming language or libraries including Python, R, Julia, Scala Spark, custom binary, Notebooks, TensorFlow, PyTorch, etc. are fully supported.
- HDFS, Hive & Apache Spark- DVC includes Spark and Hive jobs in the DVC data versioning cycle along with local ML modelling steps as well as manage Spark and Hive jobs with DVC end-to-end.
- Track Failures- DVC is built to track everything in a reproducible and easily accessible way.
On May 4th this year, DVC celebrated its 3rd-year anniversary. They also announced the pre-release of the open-sourced ML tool. Currently, DVC 1.0 has 100+ code contributors, 100+ documentation contributors, and thousands of users. It is available with all the standard installation methods including pip, conda, brew, choco, and system-specific packages: deb, rpm, msi, pkg.