Listen to this story
|
Almost a year ago, Elon Musk unveiled the prototype of Tesla’s humanoid robot, Optimus. It was still in the early stages of development and not much was revealed by the company regarding how exactly it functioned. Now, Tesla has announced major improvements in its humanoid robots and it looks like it is moving closer to what Musk has envisioned for Optimus.
— Elon Musk (@elonmusk) September 25, 2023
Last year, Optimus just waved on the stage. Now, it can pick up and sort objects, do yoga, and navigate through surroundings. Moreover, compared to others such as Boston Dynamics that work on rule-based systems, Optimus works on neural networks.
Reverse engineering Optimus’ motion
Last year during the unveiling, Optimus’ bipedal movement suggested that it may be balancing its movement using the technique called Zero-Moment Point. That may still be true, but the humanoid has definitely made a lot of improvements.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Following an in-depth analysis, Jim Fan, senior AI scientist at NVIDIA, has come forward with insights on how exactly Optimus functions with such brilliance. The impressive smooth hand movements, owes its dexterity to a unique approach to learning. It’s highly likely that these fluid motions have been cultivated through a process known as imitation learning, often referred to as “behaviour cloning“. Essentially, this means that the robot learns by mimicking human operators. This is similar to how animations are recorded for characters within games.

The alternative method, which involves reinforcement learning in a simulated environment, typically results in jerky movements and unnatural hand poses.
Motion Capture (MoCap): Some of Optimus’s hand movements might be recorded using motion capture technology, similar to what’s used in Hollywood movies. By wearing a device like the CyberGlove, a demonstrator can capture real-time hand motion signals and haptic feedback, which can then be applied to Optimus.
Custom Control System: This method likely involves a specially designed teleoperation system, allowing human operators to precisely control the robot’s movements. A noteworthy example of this approach is ALOHA, developed by Stanford AI Labs. ALOHA enables intricate, dexterous motions, such as handling tiny objects like AAA batteries or manipulating contact lenses.
Computer Vision MoCap: In contrast to wearing markers or gloves, Optimus might use computer vision for motion capture. Technologies like DexPilot from NVIDIA enable marker-less and glove-free data collection. Human operators can simply use their bare hands to perform tasks while cameras and GPUs translate their motions into data for robot learning.
VR Headset: Another intriguing method involves turning the training process into a virtual reality game. Operators can “role play” as Optimus using VR controllers or CyberGloves. This approach offers the advantage of scalability, as annotators from around the world can contribute without needing to be physically present.
It’s worth noting that Optimus could employ a combination of these methods, each with its own set of advantages and disadvantages.
Neural Architecture of Optimus
Optimus’s ability to learn from human demonstrations and exhibit precise hand movements is facilitated by a sophisticated neural architecture. The robot is trained in an end-to-end manner, which means it takes in videos as input and produces actions as output.
Image Processing: Optimus analyses images to understand its surroundings. This could involve using efficient Vision Transformers (ViT) or more conventional backbone models like ResNet or EfficientNet.
Video Analysis: Videos can be processed in two ways—treating each frame as an individual image or considering the video as a whole. Different techniques, such as SlowFast Network or RubiksNet, are used to efficiently handle video data.
Language Integration: While it’s not entirely clear whether Optimus responds to language prompts, if it does, there’s a mechanism for integrating language with visual perception. Techniques like Feature-wise Linear Modulation (FiLM) may be employed for this purpose, allowing language embeddings to influence the image processing pathway.
Action Tokenisation: To translate continuous motion signals into discrete actions that the robot can understand, Optimus might use various methods, such as categorising the movements or employing VQVAE for compression.
All these components work together within a Transformer-based controller. This controller takes in-video tokens (possibly modulated by language) and produces action tokens step-by-step. The robot continually refines its actions by observing the consequences of its previous moves, demonstrating its self-corrective abilities as seen in the demos.
How many Optimuses does the world need?
Musk confidently said that everyone in the future would want to have at least one Optimus for themselves.
Expect Optimus.pic.twitter.com/btE0gnjnWf
— Smoke-away (@SmokeAwayyy) September 24, 2023
Tesla has moved away from being just a car company to being an AI company now. It is hiring for various AI roles, specifically for building Optimus. Moreover, several companies such as RoboFab, Figure.ai, Boston Dynamics, Chinese company Fourier Intelligence, and even OpenAI’s investment in 1X robotics company hints that the next year would be that of humanoid robots.
Coexistence of humans and robots might arrive sooner than we think. And then the future will never be the same.