Version Control For ML Models, Explained

Version control frameworks allow developers to look at the records, identify differences, and merge changes wherever necessary.
Version Control

Version control is part of software configuration management used to keep track of changes to documents, computer programs, web sites etc. 

For example, version control keeps track of the source code changes. In the event of code slip-ups (usually happens when more than one person works on the same project), it protects the code from unintended consequences resulting from human oversight.

While building a machine learning model, a developer is accountable for questions such as the dataset used to train the model; hyperparameters; pipeline used to create the model; last deployed version of the model etc. This calls for the application of version control in machine learning models.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Version control frameworks allow developers to look at the records, identify differences, and merge changes wherever necessary. Versioning helps in monitoring applications and ensuring quality. It is also helpful for new members to download the current adaptation and monitor it easily.

Why Version Control 

  • The accuracy of the dataset varies when you update and tinker with different parts of the model. With versioning, developers can scope out the best model and its tradeoffs.
  • A machine learning model can fall flat for several reasons. For example, while adding more data or incorporating performance improvement measures. In case of such failures, version modelling helps in quickly reverting to the previous working version.
  • Machine learning models can be very complex. Factors such as datasets, training and testing, frameworks, among others, account for a model’s success. Version control helps in keeping dependency tracking.
  • Major updates to machine learning models are not usually rolled out at once. To ensure better performance and failure tolerance, the ML models are released in phases. Versioning allows the deployment of the right versions at the right time.
  • Model versioning is an essential component of AI/ML governance for organisations to control access, implement policy, and track model activity.

Tools 

Git: Git is the standard versioning protocol used across the board to monitor and version control software development and deployment. Git tracks changes made to the code and help in implementing, storing, and merging changes.

That said, Git also comes with a few drawbacks. It is a challenge to keep all the folders in sync in Git. The model checkpoints and data size occupy the bulk of the space. Many users alternatively store the datasets in cloud servers such as Amazon 3, reproducible codes in Git, and generate models on the fly. But working with multiple data sets breeds confusion. Further, improper documentation of data changes and upgrades can result in the model losing the context.

DVC: Data Version Control is a Git extension. It is a streamlined version of combining Git with ML specific functionality for data management. DVC can run top of any Git repository and is compatible with the Git server or provider. DVC also offers all the advantages of the distributed version control system, such as lock-free, local branching, and versioning.

Credit: DVC

Pachyderm: It delivers robust data versioning and data lineage to the machine learning loop. It also provides a flexible pipeline system that can use any tool or framework in the transformation steps. Pachyderm uses containers to execute different pipeline steps and solves data provenance issues by tracking data commits and optimising the pipeline.

Machine learning metadata (MLMD): It is a recently introduced library from the Tensorflow team to track the entire ML workflow’s full lineage. The complete lineage includes steps such as data ingestion, preprocessing, validation, training, and deployment. MLMD can be used to trace bad models back to the datasets.

More Great AIM Stories

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM