Recently, researchers from Columbia University have developed an AI framework with the capability to tell what is predictable in the future. Interestingly, the AI model is built totally using unlabelled video data.
According to the researchers, predicting the future is a core computer vision problem with several applications in robotics, security, as well as health. However, the critical issue in computer vision has always been about choosing what to predict in the future. Researchers have been looking at a spectrum of options, such as generating pixels and motion, forecasting future activities, among others to tackle this problem. The researchers have now come up with a “hedge the bet” solution where the model can at least predict the abstraction.
How Does The Model Work
The researchers proposed an AI framework by using unlabelled video data for learning what is predictable. They stated, “Instead of committing upfront to a level of abstraction to predict, our approach learns from data which features are predictable.” They added, “Motivated by how people organise action hierarchically, we propose a hierarchical predictive representation. Our approach jointly learns a hierarchy of actions while also learning to anticipate at the right level of abstraction.”
The paper spotlights their method’s ability to predict a hierarchical representation of the future. The main objective was to compare hyperbolic representations to Euclidean ones. It is important to note that the AI framework is proposed in hyperbolic space and hinges on the observation that hyperbolic geometry naturally and compactly encodes hierarchical structure.
This means, when the model is most confident, it will predict at a concrete level of the hierarchy, but when the model is not confident, it learns to automatically select a higher level of abstraction.
Under The Hood
After learning a self-supervised representation from a large collection of unlabelled videos, the researchers transferred these representations to the target domain using a smaller, labelled dataset. On the target domain, the researchers fine-tuned the same objective before fitting a supervised linear classifier using a small number of labelled examples.
The evaluation was performed using two different video datasets:
- Sports Videos: The researchers learned the self-supervised representation on two sports video datasets, Kinetics-600 and FineGym. Kinetics has 600 human action classes and 500,000 of videos with rich and diverse human actions. While FineGym is a dataset of gymnastic videos where clips are annotated with three-level hierarchical action labels, ranging from specific exercise names in the lowest level to generic gymnastic routines in the highest.
- Movies: Here, the researchers learned the self-supervised representation on MovieNet and then fine-tuned and evaluated the Hollywood2 dataset. MovieNet contains 1,100 movies and 758,000 key frames.
The future is uncertain, and, it is always impossible to anticipate the next event with surety. However, the good news is, parts of it are predictable. Through this research, the team has introduced a hyperbolic model for video prediction that represents uncertainty hierarchically.
The researchers stated, “After learning from unlabelled video, experiments and visualisations show that a hierarchy automatically emerges in the representation, encoding the predictability of the future.”
Read the paper here