Version control is part of software configuration management used to keep track of changes to documents, computer programs, web sites etc.
For example, version control keeps track of the source code changes. In the event of code slip-ups (usually happens when more than one person works on the same project), it protects the code from unintended consequences resulting from human oversight.
While building a machine learning model, a developer is accountable for questions such as the dataset used to train the model; hyperparameters; pipeline used to create the model; last deployed version of the model etc. This calls for the application of version control in machine learning models.
Version control frameworks allow developers to look at the records, identify differences, and merge changes wherever necessary. Versioning helps in monitoring applications and ensuring quality. It is also helpful for new members to download the current adaptation and monitor it easily.
Why Version Control
- The accuracy of the dataset varies when you update and tinker with different parts of the model. With versioning, developers can scope out the best model and its tradeoffs.
- A machine learning model can fall flat for several reasons. For example, while adding more data or incorporating performance improvement measures. In case of such failures, version modelling helps in quickly reverting to the previous working version.
- Machine learning models can be very complex. Factors such as datasets, training and testing, frameworks, among others, account for a model’s success. Version control helps in keeping dependency tracking.
- Major updates to machine learning models are not usually rolled out at once. To ensure better performance and failure tolerance, the ML models are released in phases. Versioning allows the deployment of the right versions at the right time.
- Model versioning is an essential component of AI/ML governance for organisations to control access, implement policy, and track model activity.
Git: Git is the standard versioning protocol used across the board to monitor and version control software development and deployment. Git tracks changes made to the code and help in implementing, storing, and merging changes.
That said, Git also comes with a few drawbacks. It is a challenge to keep all the folders in sync in Git. The model checkpoints and data size occupy the bulk of the space. Many users alternatively store the datasets in cloud servers such as Amazon 3, reproducible codes in Git, and generate models on the fly. But working with multiple data sets breeds confusion. Further, improper documentation of data changes and upgrades can result in the model losing the context.
DVC: Data Version Control is a Git extension. It is a streamlined version of combining Git with ML specific functionality for data management. DVC can run top of any Git repository and is compatible with the Git server or provider. DVC also offers all the advantages of the distributed version control system, such as lock-free, local branching, and versioning.
Pachyderm: It delivers robust data versioning and data lineage to the machine learning loop. It also provides a flexible pipeline system that can use any tool or framework in the transformation steps. Pachyderm uses containers to execute different pipeline steps and solves data provenance issues by tracking data commits and optimising the pipeline.
Machine learning metadata (MLMD): It is a recently introduced library from the Tensorflow team to track the entire ML workflow’s full lineage. The complete lineage includes steps such as data ingestion, preprocessing, validation, training, and deployment. MLMD can be used to trace bad models back to the datasets.