Listen to this story
|
On Tesla’s AI day, the Autopilot team revealed the improvements and massive upgrades in their software. Overall, the Full Self Driving (FSD) has released 35 software updates to date. Ashok Elluswamy, the Autopilot Director, announced that around 160,000 customers globally have been running the beta software of the autopilot and the self-driving system. This is a leap from 2,000 customers last year.
The Autopilot team explained how the FSD system is trained and operates—starting from neural networks to training data, and planning, alongside training infrastructure, AI compiler and inference stages, and more.

Occupancy Network
The Occupancy Network is a multi-camera-based neural network that predicts the surrounding environment of the car using inferred images. The prediction process takes place within the system of the vehicle and is not reliant on the server—therefore, it is also able to predict the future movement and position of the surrounding objects.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
The Occupancy Network uses all eight cameras on the vehicle, capturing 12-bit images, to detect objects around the car and create a single, unified volumetric occupancy 3D vector space. Since it is based on video inputs, it can also instantaneously—in less than 10 milliseconds—detect changes in the environment like crossing pedestrians, debris, or accelerating cars and adjust the speed and position of the car relative to the uncertainty.
Additionally, the team is also developing the Neural Radiance Fields (NeRF) networks by treating the output vectors from the Occupancy Network as inputs for NeRF. Using images from the cameras on the vehicles, NeRF can 3D reconstruct dense meshes using volumetric rendering.

The network is trained with a Large Auto-Labelled Dataset without any human interaction. The team built three in-house supercomputers comprising 14,000 GPUs for training and auto-labelling. The training videos are stored in 30 petabytes of storage cache, with half a million videos flowing in and out of the system daily.
Language of Lanes
In the previous detection method of lanes, Tesla used 2D Pixelwise Instance Segmentation, which could only detect the eagle lane and the adjacent lanes. This only worked efficiently on well-designed and structured roads like the highways. But on roads within the cities, the intersections and lanes are quite complex.
Tesla introduced ‘FSD Lanes Neural Network’ which comprises three components—Vision Component, Map Component, and Language Component.
The ‘Vision Component’ consists of a set of convolutional layers, attention layers, and other neural network layers that—using the videos from the eight cameras on the vehicles—produce a visual representation. This visual representation is then enhanced with the ‘Map Component’ which has the road-level navigation map which is called the ‘Lane Guidance Module’.

The Lane Guidance Module consists of neural network layers that give information about the intersection, number of lanes, and various other features of the road that the cameras on the vehicles might not be able to identify easily in real-time. The first two components produce a 3D Dense World Tensor.
This Dense World Tensor is treated as an input image and combined with Tesla’s developed language for encoding lanes and lanes topology called the ‘Language of Lanes’—which is the third component—using LLMs in which the words and tokens are the lane positions of the space.
This is a visual representation of the Tesla Autopilot / FSD neural networks that drive your car.
— Whole Mars Catalog (@WholeMarsBlog) October 2, 2022
Every dot represents a mathematical operation running in the car. $TSLA @elonmusk pic.twitter.com/PLgPNByljU
Training Data
Labelling the training data of half a million videos that pass through the supercomputers everyday is a mammoth task. The team built an Auto-Labelling machine for the Lanes Network which, using video footage from the vehicle’s camera, is able to reconstruct 3D vector spaces with the combination of the occupancy network and the newly developed language of lanes. To create one vector mesh from a single trip, the system only takes approximately 30 minutes.
Then using ‘Multi-Trip Reconstruction’, footage from different cars is combined and matched. This creates a map in an even lesser time and only requires human intervention in the end to finalise the label of the output.

To fix some of the labels where the automated labelling system was facing trouble like parked vehicles, trucks, vehicles on curvy roads, or parking lots, the team corrected 13,900 video labels manually to optimise the whole data engine.
Thanks to its accelerated video library built on PyTorch, the team noted a +30% training speed. Using the generated data from the occupancy network, the language of lanes, and NeRF-generated 3D reconstruction models, the team created a Simulation. In this 3D-created world, the team introduced new challenges, environments, and objects to train the system on different changing situations like road designs, biomes, weather conditions, and more.

Elon Musk said that the FSD beta would be available worldwide by the end of this year. “But, for a lot of countries, we need regulatory approval. So, we are somewhat gated by the regulatory approval in other countries,” explained Musk, “From a technical standpoint, it will be ready to go to a worldwide beta by the end of this year.”