Listen to this story
|
Google DeepMind released Genie, an AI model that transforms text descriptions, sketches, and photographs into interactive virtual environments. It uses an architecture with 11 billion parameters and is trained on 200,000 hours of unlabelled Internet videos for understanding and replicating environmental dynamics without manual data labeling.
Tim Rocktäschel, the team lead for Genie wrote on X, “Rather than adding inductive biases, we focus on scale. In an unsupervised way, Genie learns diverse latent actions that control characters in a consistent manner.”
This allowed it to consistently learn a diverse range of character motion, control and action. As a result, “our model can convert any image into a playable 2D world,” explained Rocktäschel.
Genie combines a spatiotemporal video tokenizer that breaks down videos to understand movement and change over time. Next, the autoregressive dynamics model predicts what will happen next in the virtual environment based on this analysis. Finally a scalable latent action model creates possible actions within the virtual world that weren’t directly shown during training. It essentially ‘imagines’ it and scales to accommodate a wide range of potential interactions.
Genie is a research so far it is unclear if it will become a real product. But if it does it has applications beyond entertainment in virtual reality, training simulations, architectural design, and urban planning etc.
Building on DeepMind’s AI contributions, Genie expands into the visual domain, enabling creative expression and interactive experiences. Previously Google DeepMind released DreamerV2 and V3, which focus on learning from interactions within environments to foster planning and goal-oriented behavior.
Unlike Genie, which observes and learns from video data, Dreamer models require interaction data to learn, making this model distinct in its method of understanding and creating virtual worlds.