Listen to this story
“Earlier in this year, we started load testing our power and cooling infrastructure, and we were able to push it over two megawatts before we tripped our substation and got a call from the city,” claimed Rajiv Kurian, Principal Engineer at Tesla. Here, Kurian was talking about Tesla’s Dojo Supercomputer during a presentation. Kurian’s statement is a testament to the sheer magnanimity of the product—using a custom cooling distribution unit in the cabinets, achieving a 2.3 MW system test that caused a San Jose substation to trip.
At the recently concluded Tesla AI Day 2022, the announcements related to Dojo and ExaPod were among the highlights of the event. Introduced first at the Tesla AI Day 2021, the company has now announced that it is putting together custom-built stack hardware—Dojo. Tesla’s top boss Elon Musk said that with Dojo, the company hopes to move beyond the label of being a car company to become the leader in building AI hardware and software.
In 2021, the then director of artificial intelligence and Autopilot Vision at Tesla, Andrej Karpathy, spoke about the strategy being employed by Tesla to develop fully self-driving cars. At that time, he detailed the specifications of their largest cluster for neural networks’ training and testing, which consisted of 720 nodes of eight NVIDIA A100 GPUs—this made up for a total of 1.8EFLOPs, ranking fifth supercomputer in the world.
The company then decided that it no longer needed or wanted to depend on other companies’ chips. Apart from this, the problem with using NVIDIA’s GPUs was that they were not designed specifically for handling machine learning training. It led Tesla to build its own chips and eventually a supercomputer. This is how ‘Project Dojo’ was born. With Dojo, Tesla wants to achieve the best AI training performance to enable more complex and larger neural network models.
D1 chip and ExaPod
The D1 chip was then announced at Tesla AI Day 2021 event; the company had then said that the chip was designed specifically for machine learning and to remove the bandwidth-related bottlenecks. Each of the D1 chip’s 354 nodes have one teraflop of compute; the entire chip could perform up to 363 teraflops of compute.
The company had also said that along with the D1 chip, the company is also developing training tiles—each of which would consist of 25 D1 chips in a multi-chip module. One tile provides nine petaflops of compute. Further, at the same event, Venkataraman announced that the company would be installing two trays of six tiles in a single cabinet for 100 petaflops of compute per cabinet. Called the ‘ExaPod’, this cabinet would be capable of 1.1 exaflops of AI compute through ten connected cabinets. The entire system, in turn, would have 120 tiles with 3000 D1 chips and more than a million nodes.
Cut to Tesla AI Day 2022, the company announced that its team has been working for the past year to deploy the functional training tile on a scale. As a result, the team successfully connected uniform nodes in the fully integrated training tile and then seamlessly joined them across the cabinet boundaries to form the Dojo accelerator. The team can now house two full accelerators in the ExaPods for a machine learning compute of one exaflop.
The Tesla engineers explained that a stack of 25 Dojo dies on a tile can replace six off-the-shelf GPU boxes. A system tray of six tiles with 640 GB DRAM that is split into 2o cards is capable of 54 petaflops of compute or 54 quadrillion floating-point operations per second. These trays are 75 mm in height and weigh 135 kg. Two of these trays are placed in an ExaPod which has power sources to keep them afloat.
Performance and future
In terms of performance, Tesla’s team demonstrated that Dojo outperformed GPUs in both auto-labelling networks and occupancy networks. Further, Dojo has been proven to be superior in terms of time taken and cost incurred—Dojo takes less than a week to train, against over a month’s time taken by GPUs, all this while the former costs much less than the latter.
This is the first generation of these devices, and by Q1 2023, Tesla hopes to build the initial ExaPod; the next generation is expected to be ten times better.
As reports suggest, Tesla would be using Dojo to auto-label training videos from its fleet and train neural networks to ultimately build self-driving systems. While this is a short-term goal, Tesla may further utilise Dojo for developing other artificial intelligence programmes.
Responding to a question from the audience, Musk said that the company would not be selling the custom cabinets as a business. However, they may explore selling compute time on Dojo, like Amazon AWS. “Just have it be a service that you can use that’s available online and where you can train your models way faster and for less money,” he said.