Machine learning has become an important cog in the wheel for the functioning of all major companies. Many companies are now building their own machine learning platform. These platforms leverage open source technologies; however, a few functions need customised solutions. For that, companies are investing in building in-house components for their machine learning platform. In this article, we take a look at a few of them.
Introduced in 2017, Uber’s Michelangelo has been in the works for two years. The goal behind building a proprietary ML-as-a-service platform is to make AI scaling as easy as booking a ride. As of 2020 Q1, the cab ride service made a staggering 1,658 million trips a day on an average; this meant that the company was sitting on a treasure trove of rich data.
Initially, Uber relied mainly on separate predictive models or smaller systems for individual problems. However, these were short-time solutions that were not adequate to contain the speed at which the AI systems at Uber were scaling up. And hence, Michelangelo came into being. It is now deployed across several Uber data centres to predict the company’s loaded online services.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
The platform consists of several open-source systems, which include components like HDFS, Spark, Cassandra, MLLib, XGBoost, Samza, and TensorFlow. Apart from the open-source systems, Uber has also developed a few of Michelangelo’s components in house.
Horovod: It is an open-source distributed training framework for TensorFlow, PyTorch, and MXNet. Its job is to make distributed deep learning fast and easy. It uses ring-allreduce and requires minimal modifications to the user code. With Horovod, the training script can be scaled up to run on hundreds of GPUs using just a few lines of Python code. Horovod can be installed on both on-premise and on-cloud platforms; furthermore, Horovod can also run on top of Apache Spark, making it possible to unify data processing and model training into a single pipeline. Once configured, the same infrastructure can be used to train models on any framework and switch between TensorFlow, PyTorch, MXNet.
Ludwig: This one is also an open-source deep-learning toolbox from Uber that is built on top of TensorFlow. It allows users to train and test deep learning models without writing code. It is an AutoML platform that provides a set of model architectures. These architectures can be combined to create an end-to-end model for a given use case. It supports functions like text classification, sentiment analysis, image classification, machine translation, and image captioning, among others.
Netflix started as a DVD rental platform in 1997 that has transformed into a major Over-The-Top giant with over 209 million subscribers. At the height of the pandemic last year, Netflix US added up to 500 new shows. One of the major catalysts in its growth has been the recommendation system, considered among the best in the business. The personalised recommendation algorithm that plays a major role in customer retention has helped Netflix pocket profits of up to $1 billion annually. Furthermore, more than 80 per cent of the shows that people watch on Netflix are discovered via its recommendation system.
Some of the in-house components developed by the Netflix machine learning teams are:
Metaflow: Developed over a period of four years, Metaflow is a full-stack framework for data science. Netflix open-sourced Metaflow in 2019. It allows the OTT company to define machine learning workflows, test them, scale in the cloud, and ultimately deploy them to production. It is a user-friendly Python/R library for scientists and engineers to build and manage real-life data science projects on. Metaflow offers data scientists the capability to choose the right modelling approach, handle data, and construct workflows easily, all this while ensuring that the resulting project executes robustly on the production infrastructure.
It was originally developed by Netflix to boost the productivity of data scientists working on a variety of projects — from classical statistics to state-of-the-art deep learning. It has been adopted by several companies outside of Netflix to power their machine learning in production.
Polynote: Netflix has a Scala-supported polyglot notebook called Polynote. It has Apache Spark integration, multi-language interoperability with Scala, Python, SQL, and others. Polynote offers data scientists and machine learning engineers a notebook environment to seamlessly integrate with Netflix’s JVM-based ML platform with Python’s popular machine learning and visualisation libraries.
Until 2016, Airbnb struggled with ML models in production, which not only took a lot of time to be developed but were also inconsistent. Besides, there were major discrepancies between the offline and online data. Keeping in view these challenges, Airbnb developed its own machine learning platform called BigHead. Built on Python and Spark, BigHead aims to tie together several open-source and in-house projects to avoid incidental complexity from ML workflows. The production cycle, training environment and collection and data transformation processes are standardised; each of these models is reproducible and iterable. Some of its components developed in-house at Airbnb include:
Zipline: It is Airbnb’s data management platform that is built specifically for machine learning use cases. It helps in defining features, helping in backfilling training sets, and enabling feature sharing. It effectively resolves the offline-online dataset inconsistency problem. Airbnb has managed to deploy better quality checks and monitoring using Zipline.
Redspot: It is a hosted, containerised, and multi-tenant Jupyter notebook service where each user’s environment is containerised through docker. It allows users to customise the notebook environment without affecting other users.
Deep Thought: It is a shared REST API service for online inference that supports all frameworks integrated into the machine learning platform. It provides standardised logging, alerting, and dashboarding for monitoring and analysis of model performance.
Launched in 2008, Spotify has quickly emerged as the world’s music catalogue. Moving away from the rudimentary recommendation features, Spotify has advanced over the years to include features like building unique playlists and Discover Weekly. All these features are possible due to the standardisation of best practices and building tools to bridge gaps between data, machine learning, and the backend via a machine learning platform.
Scio: It is a Scapa API for Apache Beam’s Java SDK built by Spotify. A company blog says that it is heavily inspired by Scalding and Spark. It offers features like — a good balance between productivity and performance, access to a larger ecosystem of infrastructure in Java, functional and type-safe code.
Zoltar: It is a common library for serving TensorFlow and XGBoost models in production. It helps load predictive machine learning models in a JVM and offers several key abstractions. Zoltar can be used to load a serialised model, featurise input data, and serve model predictions.
Apollo: It is a set of Java libraries used when writing micro-services. It includes features like an HTTP server and a URI routing system that makes it trivial to implement RESTful services. It has three main parts — apollo-api, apollo-core, and apollo-http-service.