Tensorflow recently released Mobile Video Networks (MoViNets), a family of computation and memory-efficient video networks operating on streaming video for online inference. This overcomes the problem of 3D CNNs. This article explores the challenges of 3D CNNs, how MoViNets overcomes them, and the tool’s architecture.
In machine learning, video classification is the solution of taking video frames as input and predicting a single class from a larger set of classes as output. This makes it important for the video action recognition model to consider the content of each frame. It can also understand the spatial relationships between adjacent frames and the actions in the video.
Sign up for your weekly dose of what's up in emerging technology.
3D convolutional neural networks are an extended version of 2D CNNs and are used to extract sequential images and learn spatiotemporal information from videos. While they can learn the correlation of temporal changes between adjacent frames without employing additional temporal learning methods, 3D CNNs have a huge inherent disadvantage. They have high computational complexity and excessive memory usage. Furthermore, they do not support online inference, making them difficult to work on mobile devices. Even the recent X3D networks provide increased efficiency and fall short in one way or another. They require extensive memory resources on large temporal windows, which incur high costs, or on small temporal windows, which reduce accuracy. Hence, there is a large gap between the video model performance of accurate models and efficient models for video action recognition. 2D MobileNet CNNs are fast and can operate on streaming video in real-time but are prone to be noisy and inaccurate.
TensorFlow’s MoViNets proposes a three-step approach to improve computational efficiency while substantially reducing the peak memory usage of 3D CNNs. MoViNets are a family of CNNs that efficiently process video streams and accurate output predictions with a fraction of the latency of CNN video classifiers.
The model has demonstrated state-of-the-art accuracy and efficiency on several large-scale video action recognition datasets. It does so in three essential steps:
- Design a video network search space for the neural architecture to generate efficient and diverse 3D CNN architectures.
- The Stream Buffer technique is used to decouple memory from video clip duration, so CNNs can embed arbitrary-length streaming video with lesser memory usage.
- Ensembling technique that improves accuracy without sacrificing efficiency.
MoViNets is trained on the Kinetics-600 dataset, a collection of a large-scale and high-quality set of URL links to 650,000 video clips. The dataset consists of human-annotated clips observing 400/600/700 human action classes, including human-object and human-human interactions. Through the training, MoViNets can identify 600 human actions like playing the trumpet, robot dancing or bowling. It can also classify video streams captured on a modern smartphone in real-time.
MoViNets allows the user to enjoy the benefits of 2D frame-based classifiers and 3D video classifiers while mitigating their disadvantages. It does so with a hybrid approach and replaces 3D CNNs with causal convolutions.
Causal convolutions, a form of convolution, are used for temporal data and ensure models cannot violate the ordering in which the developers model the data. This allows users to cache intermediate activations across frames with a Stream Buffer. The technique copies the input activations of all 3D operations – an output by the model and inputs them back into the model on the next clip input. As a result, MoViNets can receive one frame input at a time. This reduces peak memory usage with no loss of accuracy. In 3D CNNs, given the model is processing all frames in a video clip simultaneously, it takes up significant memory.
It also searches for efficient configurations of models using the Neural Architecture Search (NAS), a widely used technique for automating the design of artificial neural networks. It searches for model configurations on video datasets across network width, depth, and resolution. It creates a set of action classifiers that output temporally-stable predictions that smoothly transition based on frame content. Given the lack of temporal reasoning, this overcomes the problem of output predictions on 2D frames with sub-optimal performance.
“These three progressive techniques allow MoViNets to achieve state-of-the-art accuracy and efficiency on the Kinetics, Moments in Time, and Charades video action recognition datasets,” said the team. The accuracy of MoViNet-A5-Stream is the same as X3D-XL on Kinetics 600, all while requiring 80% fewer FLOPs and 65% less memory
There are various model modifications yet, and more yet to come.
MoViNet-A0-Stream, MoViNet-A1-Stream, and MoViNet-A2-Stream represent the smaller models that run in real-time. Architectural modifications such as replacing the hard swish activation with ReLU6 and removing the Squeeze-and-Excitation layers allowed the team to quantize MoViNet without an accuracy drop. Further, the models were converted to TensorFlow Lite with integer-based post-training quantization to reduce the model’s size and ensure faster running on mobile CPUs. As a result, they can provide accurate predictions.