Last updated February 19, 2024
In AI News & Update

Meta’s V-JEPA Video Model Learns by Watching

Meta released V-JEPA, a new AI model, advancing towards human-like machine intelligence by analyzing video interactions.

Share

Published on February 16, 2024

by K L Krithika

Along with Open AI’s Sora, Meta released a new AI model called Video Joint Embedding Predictive Architecture (V-JEPA) yesterday. V-JEPA improves machines’ understanding of the world by analysing interactions between objects in videos. The model continues Yann LeCun, Meta’s VP & Chief AI Scientist’s vision, for creating machine intelligence that learns similarly to humans.

The fifth iteration of I-JEPA which was released mid last year has seen developments from comparing abstract representations of images rather than the pixels themselves and extending it to videos. It advances the predictive approach by learning from learning from images to videos, which introduces the complexity of temporal (time-based) dynamics in addition to spatial information.

V-JEPA predicts missing parts of videos without needing to recreate every detail. It learns from unlabeled videos, which means it doesn’t require data that’s been categorised by humans to start learning.

This method makes V-JEPA more efficient, requiring fewer resources to train. The model is particularly good at learning from a small amount of information, making it faster and less resource-intensive compared to older models.

The model’s development involved masking large sections of videos. This approach forces V-JEPA to make guesses based on limited context, helping it understand complex scenarios without needing detailed data. V-JEPA focuses on the general idea of what’s happening in a video rather than specific details, like the movement of individual leaves on a tree.

V-JEPA has shown promising results in tests, where it outperformed other video analysis models using a fraction of the data typically required. This efficiency is seen as a step forward in AI, making it possible to use the model for various tasks without extensive retraining.

Looking ahead, Meta plans to expand V-JEPA’s capabilities, including adding sound analysis and improving its ability to understand longer videos.

This work supports Meta’s broader goal of advancing machine intelligence to perform complex tasks more like humans. V-JEPA is available under a Creative Commons NonCommercial licence, allowing researchers worldwide to explore and build upon this technology.

Access all our open Survey & Awards Nomination forms in one place