MITB Banner

Top 10 Papers to Learn About MLOps

Your new favourite go-to resource about MLOps!
Listen to this story

The past few years have witnessed remarkable advancements in machine learning. Machine learning operations (MLOps) are therefore becoming integral for data science project implementation. Through this method, companies can generate long-term value and lower the risk associated with AI/ML. 

MLOps refers to a set of approaches and tools for deploying ML models in production. Here are 10 papers as your new favourite go-to resources about MLOps. 

Let’s dive in!

  1. Machine Learning: The High-Interest Credit Card of Technical Debt 

Author(s): D. Sculley et al.

Machine learning is a significant toolkit for building complex systems quickly. However, this paper argues that these quick wins don’t come for free. Using the framework of technical debt, the researchers noted that it is extremely simple to suffer a huge amount of ongoing maintenance costs at the system level when applying ML. 

This paper aims to highlight ML-specific risk factors and patterns to avoid. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, and a variety of system-level anti-patterns. 

Read the full paper here

  1. Machine Learning Operations (MLOps): Overview, Definition, and Architecture 

Author(s): Dominik Kreuzberger et al.

MLOps is considered a vague term, and its consequences for researchers are ambiguous. To address this gap, the authors conducted mixed-method research to provide an aggregated overview of the necessary principles, components and roles along with the associated architecture and workflows.

The paper guides ML researchers and practitioners who want to automate and operate ML products with a set of technologies.

Read the full paper here

  1. Operationalizing Machine Learning: An Interview Study

Author(s): Shreya Shankar et al. 

Organisations rely on machine learning engineers (MLEs) to deploy and maintain ML pipelines in production. In semi-structured, ethnographic interviews with 18 MLEs working across many applications, the researchers try to understand the unaddressed challenges and the implications for tool builders. 

The researchers summarised common practices for successful ML experimentation, deployment, and sustaining production performance. Furthermore, they discuss interviewees’ pain points and anti-patterns, with implications for tool design.

Read the full paper here

  1. How to avoid machine learning pitfalls: a guide for academic researchers

Author(s): Michael A. Lones

The paper provides a concise outline of some common errors that occur in the use of ML techniques and ways in which they can be avoided. It is intended primarily as a guide for research students. It focuses on issues of particular concern within academic research, such as the need to make rigorous comparisons and reach valid conclusions.

Read the full paper here.

  1. Quality issues in Machine Learning Software Systems

Author(s): Pierre-Olivier Côté, Amin Nikanjam, Rached Bouchoucha, Foutse Khomh

Machine learning models are implemented as software components and deployed in Machine Learning Software Systems (MLSSs). Therefore, quality assurance of these MLSSs is integral because poor decisions can lead to the malfunction of other systems and significant financial losses. 

This paper investigates the characteristics of real quality issues in MLSSs from the practitioner’s viewpoint. Through interviews with ML practitioners, the paper identifies a list of bad practices related to poor quality in MLSSs. 

Read the full paper here

  1. Training Transformers Together

Author(s): Alexander Borzunov et al.

Training state-of-the-art models is often expensive and only affordable for large corporations and institutions.

In this demonstration, the researchers collaboratively trained a text-to-image transformer similar to OpenAI’s ‘DALL-E’. They showed that the resulting model generates images of reasonable quality on several prompts.

Read the full paper here

  1. A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts

Author(s): Konstantin Grotov, Sergey Titov et al.

In this work, the researchers compare Python code written in Jupyter Notebooks and in traditional Python scripts. The objective was to pave the way to study specific problems of notebooks that should be addressed by the development of notebook-specific tools and provide various insights that can be useful in this regard.

Read the full paper here

  1. Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training

Author(s): Mark Zhao, Niket Agarwal, Aarti Basant et al.

This paper presents Meta’s end-to-end DSI pipeline, composed of a central data warehouse built on distributed storage and a Data PreProcessing Service that eliminates data stalls. 

The researchers characterise how multiple models are collaboratively trained across data centres via continuous training. They measure the intense network, memory, and compute resources required by each training job to pre-process samples during training. The paper’s key takeaways include the following:

  • Identifying hardware bottlenecks.
  • Discussing opportunities for DSI hardware.
  • Deploying lessons learned in optimising DSI infrastructure.

Read the full paper here

  1. The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design

Author(s): Jeffrey Dean

This paper discusses machine learning advancements and their implications on the kinds of computational devices we need to build, especially in the post-Moore’s Law era. It also discusses how machine learning may help with aspects of the circuit design process. 

It provides an outline of at least one direction towards multi-task models that are activated and employ better example- and task-based routing than today’s machine learning models.

Read the full paper here

  1. Asset Management in Machine Learning: A Survey

Author(s): Samuel Idowu, Daniel Strüber, Thorsten Berger

The paper presents a survey of 17 tools with ML asset management support identified in a systematic search. They overview these tools’ features for managing the different types of assets used for engineering ML-based systems and performing experiments. 

In conclusion, most asset management support depends on traditional version control systems and only a few tools support an asset granularity level that differentiates between important ML assets, such as datasets and models.

Read the full paper here

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Tasmia Ansari

Tasmia Ansari

Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories