“The applications and hardware that OpenAI runs with Kubernetes are quite different from what one may encounter at a typical company.”
Microsoft backed OpenAI has delivered back to back blockbusters with GPT-3 and CLIP, DALL.E in a span of six months. While GPT-3 allowed OpenAI to venture into commercial API space, CLIP and DALL.E rang in a new era of fusion models. However, these models are large. GPT-3 devors all the data on the internet for training. And, it costs a few million dollars. “Scaling OpenAI’s infrastructure is unlike any what a typical startup does. So, even though they use familiar services like Kubernetes, the practices are unique to OpenAI. Today, many software service providers deploy Kubernetes for ease of operation but OpenAI claims to do it differently,” said OpenAI researchers.
Image credits: Google Cloud
Conventionally, applications with different functionalities are packed into a single deployable artifact. And monoliths are an acceptable way to build applications even today. But they still have their drawbacks. For example, deployments are time-consuming since everything has to be rolled out together. And if different parts of the monolith are managed by different teams, the roll out prepping could run into additional complexities. Same with scaling: Teams have to throw resources at the whole application, even if the bottleneck is on a single channel. To address this, developers came up with microservices.
Each piece of functionality is split into smaller individual artifacts. If there’s an update, only the exact service has to be replaced. The microservice model has scaling benefits. Now individual services can be scaled to match their traffic, so it’s easier to avoid bottlenecks without over-provisioning. So far, so good. But having one machine for each service would require a lot of resources and a whole bunch of machines. This is where containers come in handy. With containers, teams can pack their services neatly. All the applications, their dependencies, and any necessary configuration gets delivered together. Meaning, rest assured the services will run the same way, no matter where they are run. And, Kubernetes is all about managing these containers.
Kubernetes, or K8s, is an open-source system used for grouping containers that make up an application into logical units for easy management and discovery. Kubernetes automates the whole process of scaling and management. Kubernetes comes from the garage of Google after nearly 15 years in the making. The key attributes of Kubernetes include:
- Kubernetes can do a lot of heavy lifting and make sure applications are always running the way they are intended to run.
- Kubernetes allows developers to focus on applications and not worry about the underlying environment.
- Kubernetes continuously runs health checks against the services, restarting containers that fail or have stalled.
- Kubernetes allows for ease of application modernization and lets developers build new apps faster.
- Kubernetes allows applications to be deployed on-site and public clouds as well as hybrid deployments in between.
- Kubernetes enables users to automatically scale applications, up and down, based on the demand and run them efficiently.
“OpenAI has scaled Kubernetes clusters to 7,500 nodes, producing a scalable infrastructure for large models like GPT-3, CLIP, and DALL·E.”OpenAI
How OpenAI Does It
At OpenAI, workloads are constantly changing. The team at OpenAI builds production level applications even though they are short lived. The nature of AI research at OpenAI demands high quality infrastructure even for applications which will never see the light of the day. On our largest clusters, said OpenAI, we could possibly have approximately 200,000 IP addresses in use at any one time.
Kubernetes is good at managing large clusters. At OpenAI, Kubernetes API Servers and etcd are critical components to a healthy working cluster. etcd is a consistent and highly-available key value store used as Kubernetes’ backing store for all cluster data.
“Upside of scaling Kubernetes is a simple infrastructure that allows our machine learning research teams to move faster and scale up without changing their code.”OpenAI
API Servers are memory intensive, and tend to scale linearly with the number of nodes in the cluster. When the team at OpenAI scaled to 7,500 nodes every server used up 70GB space.
Typically, API Servers are run within kube services but OpenAI prefers running them outside the cluster itself. Both etcd and API servers at OpenAI run on their own dedicated nodes. The research lab’s largest clusters run 5 API servers and 5 etcd nodes to spread the load and minimise impact if one were to go down.
And as their clusters have grown, the team at OpenAI faced auto scaling changes. Researchers found it difficult getting all of the allocated capacity. Traditional job scheduling systems, stated OpenAI, have a lot of different features available to fairly run work between competing teams, which Kubernetes does not have.The team took inspiration from job scheduling systems and built several capabilities in a Kubernetes-native way.
“We’ve found Kubernetes to be an exceptionally flexible platform for our research needs. It has the ability to scale up to meet the most demanding workloads we’ve put on it,” concluded OpenAI.
OpenAI Infra By The Numbers
- OpenAI’s largest clusters run 5 API servers and 5 etcd nodes to spread the load and minimise impact.
- OpenAI’s cluster with 7,500 nodes requires 70GB of heap per API Server.
- Alias-based IP addressing is used as the largest clusters and approximately 200,000 IP addresses can be in use at any one time..
Know more about OpenAI Scaling challenges here.