Branded Content

The Indispensable MLOps Engineer in an ML Lifecycle

“If you don’t have MLOps, you don’t have proper automation which reduces the speed of any deployment or development.”

Listen to this story

In a data-driven innovation realm where the synergy between machine learning and operations (MLOps) has emerged as a crucial element, the role of an MLOps Engineer has also continuously evolved over time. The traditional role where data scientists and engineers used to train and deploy machine learning models in production has evolved into a dedicated, specific role.

“With ML models now solving a number of complex queries, organisations are forced to use more such models in a production environment, which demanded the creation of specific skills or roles to manage, deploy and monitor these models, thus MLOps materialised,” said Steve George, MLOps Manager at Tredence.

“At Tredence, we have a dedicated vertical specialised in MLOps that not only gather client requirements but also help them understand the importance of MLOps lifecycle in their workflow,” he said.

Missing Piece in the Machine Learning Puzzle

In complete alignment with the quote that refers to MLOps as the missing piece in the machine learning puzzle, George believes that MLOps leverages the entire ML lifecycle. “If you don’t have MLOps, you don’t have proper automation which reduces the speed of any deployment or development.”

Furthermore, MLOps ensures that the deployed models are running smoothly in the production environment, by continuously monitoring it,” said George.

Automation Is the Key

Automation is believed to be a “critical component of modern machine learning operation” that helps in speeding up the training and deployment process. There are a number of advantages that automated development brings to the table, such as, consistency in model training, less time to market new models, cost saving and governance. “With the help of automation, hyper parameter tuning (the values MLOps engineers need to assign to each of the parameters while training the model) can be performed with ease, and developers can leverage this to select the best fit model,” said George.

Furthermore, George emphasises on how deploying retrained models is more important than the retraining process. While deploying them it should be ensured that there should be a rollback mechanism if the retrained model is poor in performance or having any trouble, and during deployment, the current workflow should not be interrupted. “Additionally, I prefer to have a manual intervention to allow the new model to be deployed,” he said.

Deep Learning and Statistical Model Monitoring

Model monitoring is one of the most important pillars of MLOps and the approach for the same has changed with time. “Earlier, when I started my journey in MLOps, we used to monitor only the infrastructure of the model to check if the compute of each deployed model is over-utilised (indicating performance issues) or under-utilised (cost is not managed). Now, we need to look at two aspects for model monitoring- model performance and data drift”, said George.

For model performance, the accuracy of the models are checked to see if it is within the threshold, and for data drift, input data deviation (from training data) is checked. To facilitate this, there are statistical and deep learning methods. At Tredence, there is an in-house to accomplish this task. “We have our own in-house product called MLWorks, which is both platform and cloud agnostic, and helps capture data metrics and the model’s performance. This model works on statistical methods, however, based on client requirements, either of the methods are used.

“Deep learning is always a kind of a black box thing. Statistical models are pure maths, where you can explain, but deep learning is a bit difficult to explain, though the outcome may be much better than the statistical method,” said George.

Model Governance for Ethical AI Usage

Effective model governance helps organisations mitigate risks that are associated with AI and machine learning. Model governance is done in two stages, one in pre-deployment and then in post deployment. “In pre-deployment, there are a few scenarios to be considered – model explainability, bias and fairness, model reproducibility, and model versioning where different iterations of model versions are created which helps track changes and improvements made during training. During post deployment, model accessibility, retraining workflows and model monitoring is required,” said George.

ML Engineers and Data Scientists Collaboration at Tredence

George believes that an effective collaboration between data scientists, ML engineers, and MLOps is required for successful completion of a project. “At Tredence, we take this seriously. We have communication channels such as Team, Slack etc. where all participants can share ideas and address any issues. For project management, we follow an agile methodology where team members discuss and identify the best way in executing any project.”

At Tredence, everyone likes to learn. Being a young company, everyone wants to learn and improve their cross-functional and domain knowledge, which makes collaboration more easier and effective. We even conduct sessions where we encourage people to attend, so that they understand the pace of MLOps.”

The Path That Lies Ahead

With the continuous development to simplify processes, the next phase of any workflow is to operationalise it. “Considering how we have large language models and auto ML now, LLM Ops and Auto MLOps is the next thing,” said George. However, LLM comes with its challenges.

Four main challenges in MLOps can be broadly classified as: data quality and versioning, where inconsistent and unclean data necessitates data pipelines for quality control and versioning to track changes, model versioning and management , where frequent model updates require versioning to compare models and ensure correct ones are deployed, infrastructure scalability and optimisation, where balancing infrastructure setup for model performance and cost, with auto scaling as an effective tool, and security and compliance, where protecting data, models, and infrastructure through collaboration with administrators and adherence to security practices is critical.

“I am currently working on an LLM ops project, and I am trying to identify the best infrastructure that will support this particular workflow. For LLM you need a really good compute, which is something we are trying to figure out. So, it’s all about learning,” concluded George.

Contributed as part of AIM Branded Content. Know more here.

This article is contributed by

Vandana Nair

As a rare blend of engineering, MBA, and journalism degree, Vandana Nair brings a unique combination of technical know-how, business acumen, and storytelling skills to the table. Her insatiable curiosity for all things startups, businesses, and AI technologies ensures that there's always a fresh and insightful perspective to her reporting.