Listen to this story
|
The past few years have witnessed remarkable advancements in machine learning. Machine learning operations (MLOps) are therefore becoming integral for data science project implementation. Through this method, companies can generate long-term value and lower the risk associated with AI/ML.
MLOps refers to a set of approaches and tools for deploying ML models in production. Here are 10 papers as your new favourite go-to resources about MLOps.
Let’s dive in!
- Machine Learning: The High-Interest Credit Card of Technical Debt
Author(s): D. Sculley et al.
Machine learning is a significant toolkit for building complex systems quickly. However, this paper argues that these quick wins don’t come for free. Using the framework of technical debt, the researchers noted that it is extremely simple to suffer a huge amount of ongoing maintenance costs at the system level when applying ML.
This paper aims to highlight ML-specific risk factors and patterns to avoid. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, and a variety of system-level anti-patterns.
Read the full paper here.
- Machine Learning Operations (MLOps): Overview, Definition, and Architecture
Author(s): Dominik Kreuzberger et al.
MLOps is considered a vague term, and its consequences for researchers are ambiguous. To address this gap, the authors conducted mixed-method research to provide an aggregated overview of the necessary principles, components and roles along with the associated architecture and workflows.
The paper guides ML researchers and practitioners who want to automate and operate ML products with a set of technologies.
Read the full paper here.
- Operationalizing Machine Learning: An Interview Study
Author(s): Shreya Shankar et al.
Organisations rely on machine learning engineers (MLEs) to deploy and maintain ML pipelines in production. In semi-structured, ethnographic interviews with 18 MLEs working across many applications, the researchers try to understand the unaddressed challenges and the implications for tool builders.
The researchers summarised common practices for successful ML experimentation, deployment, and sustaining production performance. Furthermore, they discuss interviewees’ pain points and anti-patterns, with implications for tool design.
Read the full paper here.
- How to avoid machine learning pitfalls: a guide for academic researchers
Author(s): Michael A. Lones
The paper provides a concise outline of some common errors that occur in the use of ML techniques and ways in which they can be avoided. It is intended primarily as a guide for research students. It focuses on issues of particular concern within academic research, such as the need to make rigorous comparisons and reach valid conclusions.
Read the full paper here.
- Quality issues in Machine Learning Software Systems
Author(s): Pierre-Olivier Côté, Amin Nikanjam, Rached Bouchoucha, Foutse Khomh
Machine learning models are implemented as software components and deployed in Machine Learning Software Systems (MLSSs). Therefore, quality assurance of these MLSSs is integral because poor decisions can lead to the malfunction of other systems and significant financial losses.
This paper investigates the characteristics of real quality issues in MLSSs from the practitioner’s viewpoint. Through interviews with ML practitioners, the paper identifies a list of bad practices related to poor quality in MLSSs.
Read the full paper here.
- Training Transformers Together
Author(s): Alexander Borzunov et al.
Training state-of-the-art models is often expensive and only affordable for large corporations and institutions.
In this demonstration, the researchers collaboratively trained a text-to-image transformer similar to OpenAI’s ‘DALL-E’. They showed that the resulting model generates images of reasonable quality on several prompts.
Read the full paper here.
- A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts
Author(s): Konstantin Grotov, Sergey Titov et al.
In this work, the researchers compare Python code written in Jupyter Notebooks and in traditional Python scripts. The objective was to pave the way to study specific problems of notebooks that should be addressed by the development of notebook-specific tools and provide various insights that can be useful in this regard.
Read the full paper here.
- Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
Author(s): Mark Zhao, Niket Agarwal, Aarti Basant et al.
This paper presents Meta’s end-to-end DSI pipeline, composed of a central data warehouse built on distributed storage and a Data PreProcessing Service that eliminates data stalls.
The researchers characterise how multiple models are collaboratively trained across data centres via continuous training. They measure the intense network, memory, and compute resources required by each training job to pre-process samples during training. The paper’s key takeaways include the following:
- Identifying hardware bottlenecks.
- Discussing opportunities for DSI hardware.
- Deploying lessons learned in optimising DSI infrastructure.
Read the full paper here.
- The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design
Author(s): Jeffrey Dean
This paper discusses machine learning advancements and their implications on the kinds of computational devices we need to build, especially in the post-Moore’s Law era. It also discusses how machine learning may help with aspects of the circuit design process.
It provides an outline of at least one direction towards multi-task models that are activated and employ better example- and task-based routing than today’s machine learning models.
Read the full paper here.
- Asset Management in Machine Learning: A Survey
Author(s): Samuel Idowu, Daniel Strüber, Thorsten Berger
The paper presents a survey of 17 tools with ML asset management support identified in a systematic search. They overview these tools’ features for managing the different types of assets used for engineering ML-based systems and performing experiments.
In conclusion, most asset management support depends on traditional version control systems and only a few tools support an asset granularity level that differentiates between important ML assets, such as datasets and models.
Read the full paper here.