Guide To Zooming Slow-Mo: One-Stage Space-Time Video Super-Resolution

Zooming Slow-Mo is a one-stage framework for space-time video super-resolution that directly synthesizes high frame rate, high-res videos without generating the intermediate low-res frames.

Share

Published on May 3, 2021

by Aditya Singh

Space-time video super-resolution is a computer vision task that aims to increase video resolution both in time and space. It creates a high-resolution slow-motion video from a low frame rate, low-resolution video. It can be divided into two sub-tasks: video frame interpolation (VFI) for generating intermediate video frames and video super-resolution(VSR). Existing state-of-the-art two-stage methods use large frame-synthesis modules to predict high-resolution frames; this leads to higher computational complexity and can be time-consuming. Furthermore, frame interpolation and spatial super-resolution sub-tasks are intra-related and carry coupled information that can help speed up both.

To overcome these problems, Xiaoyu Xiang, Yapeng Tian proposed a new one-stage approach for space-time video super-resolution- Zooming Slow-Mo. Zooming Slow-Mo directly synthesizes a high-resolution slow-motion video from a low frame rate, low-resolution video. It temporally interpolates low-resolution frame features in missing low-resolution video frames utilizing the local temporal contexts via a feature temporal interpolation network. Then a deep reconstruction network is used to generate high-resolution slow-motion video frames.

Architecture & Approach

The Zooming Slow-Mo framework consists of four main parts: feature extractor, a frame feature temporal interpolation module, deformable ConvLSTM, and a deep frame reconstruction network.

The feature extraction module consists of one convolution layer followed by five residual blocks. It extracts feature map: {F^L_2t−1}ⁿ⁺¹_t=1from input video frames. The proposed frame feature interpolation network then uses these feature maps to generate the low-resolution feature maps of the non-existent intermediate frames.

Frame Feature Temporal Interpolation

Given two feature maps: F^L₁ and F^L_3, from low-res input video frames: I^L₁ and I^L₃, the aim is to synthesize the feature map F^L₂ of the missing intermediate frame I^L₂. Existing frame interpolation networks perform temporal interpolation on pixel-wise video frames. This leads to a two-stage STVSR design. In contrast, Zooming Slow-Mo learns a feature temporal interpolation function f(·) to directly synthesize the intermediate feature maps. This interpolation function can be formulated as:

Here T₁(·) and T₃(·) are two sampling functions, ????₁ and ????₃ are the corresponding sampling parameters; and H(·) is a blending function for aggregating sampled features. To generate accurate intermediate feature maps, T₁(·) needs to capture the forward motion information between F^L₁ and F^L₂, and T₃(·) needs to capture backward motion information between F^L₃ and F^L₂. However, F^L₂ does not exist. The information flow between F^L₁ and F^L₃ is used to approximate the backward and forward motion information to work around this issue. A linear blending function is used to combine the two sampled feature maps:

Here ???? and ???? are learnable 1 × 1 convolution kernels and ∗ is the convolution operator.

Deformable ConvLSTM

Temporal information is essential in video restoration tasks. Therefore, instead of reconstructing high-res frames from individual feature maps, Zooming Slow-Mo aggregates temporal contexts from neighbouring frames. It employs ConvLSTM, a popular 2D sequence data modelling method, to perform temporal aggregation. However, ConvLSTM can only capture motion between previous states and the current input feature map with small convolution receptive fields. This greatly limits ConvLSTMs ability to handle large motions.

When working with videos with large motions, this leads to a severe temporal mismatch between previous states and the current feature map F^L_t. As a result, the reconstructed high-resolution frame I^H_t suffers from artifacting. To overcome this problem and make better use of the global temporal contexts, a state updating cell with deformable alignment is embedded into ConvLSTM:

g^h and g^c are used to denote the general functions of several convolution layers, ∆p^h_t and ∆p^c_t are the learned offsets. h^a_t−1 and c^a_t−1refer to the current feature map F^L_taligned hidden and cell states respectively.In contrast to vanilla ConvLSTM, deformable ConvLSTM enforces the hidden state and cell state to align with the current feature map F^L_t. In addition to that, the Deformable ConvLSTM is used in a bidirectional manner to maximize the utilization of temporal information.

Frame Reconstruction

A temporally shared synthesis network is used for frame reconstruction; it synthesizes high-resolution frames from individual hidden states h_t. The reconstruction network has 40 stacked residual blocks for learning deep features and uses PixelShuffle for sub-pixel upscaling to reconstruct high-res frames. A reconstruction loss function is used to optimize this network:

Here I^GT_t denotes the t-th ground-truth high-res video frame and ???? is set to 1 × 10⁻³.

Space-Time Video Super-Resolution using Zooming Slow-Mo

Clone the Zooming Slow-Mo GitHub repository.

git clone --recursive https://github.com/Mukosame/Zooming-Slow-Mo-CVPR-2020.git

Install OpenCV, PyTorch and other requirements.

pip install -r requirements.txt

Compile the deformable convolutional network V2

 cd $ZOOMING_ROOT/codes/models/modules/DCNv2
 bash make.sh

Perform space-time video super-resolution using the video_to_zsm.py script.

python codes/video_to_zsm.py --model experiments/pretrained_models/xiang2020zooming.pth --video low-res-vid.mp4 --output low-res-vid.mp4 --N_out 3

A higher resolution slow-motion video synthesized using Zooming Slow-Mo

Code Source

Last Epoch

Zooming Slow-Mo versus existing state-of-the-art two-stage approaches

This article introduced Zooming Slow-Mo, a one-stage framework for space-time video super-resolution that directly synthesizes high frame rate, high-resolution videos without generating the intermediate low-resolution frames. It introduces a deformable feature interpolation network that enables feature-level temporal interpolation. Furthermore, it uses a modified deformable ConvLSTM for aggregating temporal information and handling large motions. Using its one-stage design, Zooming Slow-mo is able to make use of the intra-relatedness between temporal interpolation and spatial super-resolution. It outperforms existing state-of-the-art two-stage approaches not only in terms of effectiveness but also efficiency.

For a more in-depth understanding of Zooming Slow-Mo, refer to the following resources:

Access all our open Survey & Awards Nomination forms in one place