MLOps – “the Why” and “the What”

Share

Published on November 25, 2020

by Arnab Bose

This article outlines the motivation behind MLOps, its relation to DevOps, and the different components that comprise an MLOps framework. The article is arranged as follows.

MLOps motivation
MLOps challenges similar to DevOps
MLOps challenges different from DevOps
MLOps components

1) MLOps Motivation

Machine Learning (ML) models built by data scientists represent a small fraction of the components that comprise an enterprise production deployment workflow as illustrated in Fig 1 [1] below. To operationalize ML models, data scientists are required to work closely with multiple other teams such as business, engineering and operations. This represents organizational challenges in terms of communication, collaboration and coordination. The goal of MLOps is to streamline such challenges with well-established practices. Additionally, MLOps brings about agility and speed that is a cornerstone in today’s digital world.

Fig 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small box in the middle. The required surrounding infrastructure is vast and complex.

2) MLOps challenges similar to DevOps

The challenges of ML model operationalization have a lot in common to software productionisation where DevOps has proven itself.

Therefore, adopting the best practices from DevOps is a prudent approach to help data scientists overcome challenges common to software productionisation. For example, use of agile methodology promoted by DevOps in contrast to waterfall methodology is an efficiency boost. Additional DevOps practices used in MLOps are listed in Table 1.

Table 1: MLOps leveraging DevOps

ML model operationalization Challenges	Solution from DevOps
1) Continuous integration and continuous delivery (CI/CD): To setup a pipeline such that the updates are continuously built and ready for production accurately, securely, and seamlessly.	Use a CI/CD framework to build, test and deploy software. It offers the benefits of reproducibility, security and code version control.
2) Longer development to deployment lifecycle: Data Scientists develop models/algorithms and hand them over to operations to deploy into production. Lack of coordination and improper handoff between the two parties lead to delays and errors. 3) Ineffective communication between teams leads to delays in final solution: The evaluation of a ML solution usually comes towards the end of the project lifecycle. With the development teams usually working in silos, solution becomes a black-box to other stakeholders. This is worsened by the lack of intermediate feedback. These pose significant challenges in terms of time, effort and resources.	Agile methodology solves this coordination problem by enforcing end-to-end pipeline setup at the initial stage itself. Agile methodology divides the project into sprint. In each sprint, developers deliver incremental features that are ready for deployment. Output from each sprint (made using pipelines) is visible to each and every member from very early stage of the project. Therefore, the risk of last minute surprise reduces and early feedback becomes a common practice. In industry parlance, this does a “shift left” to coordination issues.

3) MLOps challenges different from DevOps

According to industry parlance, MLOps is DevOps for ML. While it is true to some extent, there are challenges typical to ML that need to be addressed by MLOps platforms.

An example of such a challenge is the role of data. In traditional software engineering (i.e. software 1.0), developers write logic/rules (as code) that is well defined in the program space as demonstrated in Fig 2 [2]. However, in machine learning (i.e. software 2.0), data scientists write code that defines how to use parameters to solve a business problem.

The parameter values are found using data (with techniques such as gradient descent). These values may change with different versions of the data, thereby changing the code behavior. In other words, data plays an equally important role as the written code in defining the output. And both can change independently of each other. This adds a layer of data complexity in addition to the model code as an intrinsic part of the software that needs to be defined and tracked. Fig 2: Software 1.0 vs Software 2.0

The various challenges that needs to be taken care by MLOps platform are listed in Table 2.

Table 2: ML specific challenges

ML Specific challenges	Description
1) Data and hyper-parameters versioning	In the traditional software application, code versioning tools are used to track changes. Version control is a prerequisite for any continuous integration (CI) solution as it enables reproducibility in a fully automated fashion. Any change in source code triggers the CI/CD pipeline to build, test and deliver production-ready code. In Machine Learning, output model can change if algorithm code or hyper-parameters or data change. While code and hyper-parameters are controlled by developers, change in data may not be. This warrants the concept of data and hyper-parameters versioning in addition to algorithm code. Note that data versioning is a challenge for unstructured data such as images and audio and that MLOps platforms adopt unique approaches to this challenge.
2) Iterative development and experimentations	ML algorithm and model development is iterative and experimental. It requires a lot of parameter tuning and feature engineering. ML pipelines work with data versions, algorithm code versions and/or hyper-parameters. Any change in these artefacts (independently) trigger new deployable model versions that warrant experimentation and metrics calculations. MLOps platforms track the complete lineage for these artefacts.
3) Testing	Machine Learning requires data and model testing to detect problems as early in the ML pipeline as possible. a) Data validation – check that the data is clean with no anomalies and new data is conformant to the prior distribution. b) Data preprocessing – check that data is preprocessed efficiently and in a scalable manner and avoid any training-serving skew [3].c) Algorithm validation – track classification/regression metrics based on the business problem as well as ensure algorithm fairness.
4) Security	ML models in production are often part of a larger system where its output is consumed by applications that may or may not be known. This exposes multiple security risks. MLOps needs to provide security and access control to make sure outputs of ML models is used by known users only.
5) Production Monitoring	Models in production requires continuous monitoring to make sure models are performing per expectation as they process new data. There are multiple dimensions of monitoring such as covariate shift, prior shift, among others
6) Infrastructure requirement	ML applications need scale and compute power that translates into a complex infrastructure. For example, GPU may be necessary during experimentations and production scaling may be necessary dynamically.

4) MLOps Components

With the background in MLOps and its similarity to as well difference from DevOps, the following describes the different components that comprise an MLOps framework as shown in Fig 3. The underlying them of the workflow is the agile methodology as indicated in Section 2.

1. Use case discovery: This stage involves collaboration between business and data scientists to define a business problem and translate that into problem statement and objectives solvable by ML with associated relevant KPIs (Key Performance Indicator).

2. Data Engineering: This stage involves collaboration between data engineer and data scientist to acquire data from various sources and prepare the data (processing / validation) for modeling.

3. Machine Learning pipeline: This stage is designing and deploying a pipeline integrated with CI/CD. Data scientist use pipelines for multiple experimentation and testing. The platform keeps track of data and model lineage, and associated KPIs across the experiments.

4. Production deployment: This stage accounts for secure and seamless deployment into production server of choice, be it public cloud, on premise or hybrid.

5. Production monitoring: This stage includes both model and infrastructure monitoring. Models are continuously monitored using configured KPIs like changes in input data distribution or change in model performance. Triggers are set for more experimentations with new algorithm, data and/or hyper-parameters that generate a new version of ML pipeline. Infrastructure is monitored per memory and compute requirements and to scale as needed.

References

1) D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J. Crespo, and D. Dennison, “Hidden technical debt in machine learning systems”, in Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp. 2503–2511. [Online]. Available: http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems

2) A. Karpathy, “Software 2.0”, November 12, 2017 [Online]. Available: https://medium.com/@karpathy/software-2-0-a64152b37c35

3) E. Breck, M. Zinkevich, N. Polyzotis, S. Whang and S. Roy, “Data validation for machine learning”, in Proceedings of the 2nd SysML Conference, Palo Alto, CA, USA, 2019. Available: https://mlsys.org/Conferences/2019/doc/2019/167.pdf

Access all our open Survey & Awards Nomination forms in one place

Arnab Bose

Dr. Arnab Bose is Chief Scientific Officer at Abzooba, a data analytics company and an adjunct faculty at the University of Chicago where he teaches Machine Learning and Predictive Analytics, Machine Learning Operations, Time Series Analysis and Forecasting, and Health Analytics in the Master of Science in Analytics program. He is a 20-year predictive analytics industry veteran who enjoys using unstructured and structured data to forecast and influence behavioral outcomes in healthcare, retail, finance and transportation. His current focus areas include health risk stratification and chronic disease management using machine learning, and production deployment and monitoring of machine learning models. Arnab has published book chapters and refereed papers in numerous Institute of Electrical and Electronics Engineers (IEEE) conferences & journals. He has received Best Presentation at American Control Conference and has given talks on data analytics at universities and companies in US, Australia, and India. Arnab holds MS and Ph.D. degrees in electrical engineering from the University of Southern California, and a B.Tech. in electrical engineering from the Indian Institute of Technology at Kharagpur, India.