Last updated October 15, 2021
In AI Origins & Evolution

Now Facebook’s AI Model Can Anticipate Your Future Actions

AVT would be a strong candidate for tasks beyond anticipation, such as self-supervised learning, general action recognition in tasks that require modelling temporal ordering, and even for discovering action schemas and boundaries.

Share

Published on October 17, 2021

by kumar Gandharv

Anticipating the next moves and predicting the same with accuracy is definitely exciting but difficult. For example, it may be easy to predict whether the next ball in a game of cricket will be hit for a six or four; however, any wrong prediction will not be a costly affair. Let’s consider another situation where an autonomous vehicle is on the road at a stop sign. Now, the situation demands the AV to predict whether the pedestrian will cross the road or not. Anticipating future activities is a difficult issue for AI since it necessitates both predicting the multimodal distribution of future activities and modelling the course of previous actions.

To overcome this challenge, two researchers, namely Rohit Girdhar from Facebook AI Research and Kristen Grauman from the University of Texas, Austin, came together to propose Anticipative Video Transformer (AVT).

The science behind AVT

The researchers leveraged recent advancements in transformer architectures, especially for image modelling and natural language processing for AVT. This end-to-end attention-based video modelling architecture takes into account the previously observed video in order to anticipate future actions.

The model is designed to produce predictions for future actions, given a video clip as input. In order to accomplish the same, it leverages a two-stage architecture, consisting of:

Backbone network that operates on individual frames or short clips. This backbone, referred to as AVT-b, adopts the recently proposed Vision Transformer (ViT) architecture, and it has earlier shown impressive results for static image classification, followed by;
Head architecture that operates on the frame/clip level features to predict future features and actions. It is referred to as AVT-h and is used to predict the future features for each input frame using a Causal Transformer Decoder.

In addition, AVT employs causal attention modelling—predicting the future actions based only on the frames observed so far—and is trained using objectives inspired by self-supervised learning. The AVT model architecture is shown below:

Image Source: Paper

In addition to that, researchers train the model to predict future actions and features using three losses:

First, the classification of features was done in the last frame of a video clip in order to predict labelled future action.
Second, the model regresses the intermediate frame feature to the features of the succeeding frames, which ultimately trains the model to predict what the next possible step will be .
Third, they train the model to classify intermediate actions.

“Through extensive experimentation on four popular benchmarks, we show its applicability in anticipating future actions, obtaining state-of-the-art results and demonstrating the importance of its anticipative training objectives,” as per the paper.

Talking about some of its future applications, the researchers believe AVT would be a strong candidate for tasks beyond anticipation, such as self-supervised learning, general action recognition in tasks requiring modelling temporal ordering, and even discovering action schemas and boundaries.

Recent Facebook AI advances

In a recent claim, Facebook AI introduced a new language model solely based on audio – Generative Spoken Language Model (GSLM). This may now be considered as the first high-performance NLP model independent of the text. GSLM may function directly from raw audio signals without labels or text, with possible speech input to speech output, expanding the frontiers for textless NLP in diverse oral languages.
Last month, the Facebook team introduced Instance-Conditioned GAN (IC-GANs), a new picture generating model. With or without input photographs from the training set, this new model produces high-quality, diversified images. In addition, IC-GANs, in contrast to previous approaches, may produce realistic, unforeseen image combinations.
Opacus, a free, open-source library for training deep learning models with differential privacy, was recently released by Facebook. This new tool is intended to be simple, flexible, and quick. It has a simple and user-friendly API that allows ML practitioners to make a training pipeline private with just two lines of code.

Facebook AI recent advancements made in the field of AI and ML have come a long way. Every now and then, the researchers from the organisation are advancing the field of artificial intelligence with some good, result-oriented works.

Access all our open Survey & Awards Nomination forms in one place