“TimeSformer is the first video architecture that’s based purely on Transformers”Facebook AI
Recently, researchers from Facebook AI have introduced a new video architecture known as TimeSformer. TimeSformer (Time-Space Transformer) is built exclusively on self-attention architectures, i.e. Transformers. In a paper called “Is Space-Time Attention All You Need for Video Understanding?” the researchers claimed that TimeSformer is faster to train and has higher test-time efficiency than competing architectures.
Self-attention architectures like Transformers have outstanding capabilities at capturing long-range dependencies among words and their training scalability. It has been widely used for machine translation, question-answering, general language understanding, auto-regressive word generation, among others.
Sign up for your weekly dose of what's up in emerging technology.
Transformers’ recent success in the domain of natural language processing (NLP) has motivated researchers to implement this model in computer vision applications and tasks.
The Tech Behind
As mentioned above, TimeSformer is built purely on the self-attention mechanism used in Transformer models. According to the researchers, to apply Transformers to video, the model interprets the input video as a time-space sequence of image patches extracted from the individual frames. The model then captures each patch’s semantics by explicitly comparing it with the other patches in the video. This allows TimeSformer to capture both short-term dependencies between neighbouring patches and long-range correlations between distant patches.
Why It Matters
Compared to convolutional neural networks (CNNs), Transformers impose less restrictive inductive biases, and that’s one of the main reasons behind using transformer-based video architectures. This, in result, broadens the family of functions they can represent and renders them to modern big-data regimes where there is less need for strong inductive priors.
The computational costs of traditional 3D CNNs are prohibitively high. Meanwhile, TimeSformer maintains a low computational cost by decomposing the video into a small set of non-overlapping patches and then applying a form of self-attention that avoids exhaustive comparison between all pairs of patches.
Also, while convolutional kernels are specifically designed to capture short-range spatiotemporal information, the self-attention mechanism can be applied to capture both local and global long-range dependencies by directly comparing feature activations at all space-time locations.
Below we have mentioned some of the reasons as to why this Transformer matters:
- While comparing TimeSformer with modern 3D convolutional neural networks (CNNs), the model proved to be roughly three times faster to train and requires less than one-tenth of the amount of computing for inference. This achievement unlocked a step toward supporting various applications that require real-time or on-demand processing of videos.
- The scalability of this video Transformer enables the training of extremely larger models on longer video clips. According to the researchers, this opens the door to AI systems that can understand complex human actions and behaviours in videos.
The researchers stated, “The low inference cost of TimeSformer is an important step toward supporting future real-time video processing applications, such as AR/VR, or intelligent assistants that provide services based on video taken from wearable cameras.”
TimeSformer adapts the standard Transformer architecture to video. Meaning, this new architecture can break new grounds in video modelling. The model achieved state-of-the-art results on various major action recognition benchmarks, including the Kinetics-400 and Kinetics-600 action recognition datasets.