Space-time video super-resolution is a computer vision task that aims to increase video resolution both in time and space. It creates a high-resolution slow-motion video from a low frame rate, low-resolution video. It can be divided into two sub-tasks: video frame interpolation (VFI) for generating intermediate video frames and video super-resolution(VSR). Existing state-of-the-art two-stage methods use large frame-synthesis modules to predict high-resolution frames; this leads to higher computational complexity and can be time-consuming. Furthermore, frame interpolation and spatial super-resolution sub-tasks are intra-related and carry coupled information that can help speed up both.
To overcome these problems, Xiaoyu Xiang, Yapeng Tian proposed a new one-stage approach for space-time video super-resolution- Zooming Slow-Mo. Zooming Slow-Mo directly synthesizes a high-resolution slow-motion video from a low frame rate, low-resolution video. It temporally interpolates low-resolution frame features in missing low-resolution video frames utilizing the local temporal contexts via a feature temporal interpolation network. Then a deep reconstruction network is used to generate high-resolution slow-motion video frames.
Architecture & Approach
The Zooming Slow-Mo framework consists of four main parts: feature extractor, a frame feature temporal interpolation module, deformable ConvLSTM, and a deep frame reconstruction network.
The feature extraction module consists of one convolution layer followed by five residual blocks. It extracts feature map: {FL2t−1}n+1t=1 from input video frames. The proposed frame feature interpolation network then uses these feature maps to generate the low-resolution feature maps of the non-existent intermediate frames.
Frame Feature Temporal Interpolation
Given two feature maps: FL1 and FL3, from low-res input video frames: IL1 and IL3, the aim is to synthesize the feature map FL2 of the missing intermediate frame IL2. Existing frame interpolation networks perform temporal interpolation on pixel-wise video frames. This leads to a two-stage STVSR design. In contrast, Zooming Slow-Mo learns a feature temporal interpolation function f(·) to directly synthesize the intermediate feature maps. This interpolation function can be formulated as:
Here T1(·) and T3(·) are two sampling functions, ????1 and ????3 are the corresponding sampling parameters; and H(·) is a blending function for aggregating sampled features. To generate accurate intermediate feature maps, T1(·) needs to capture the forward motion information between FL1 and FL2, and T3(·) needs to capture backward motion information between FL3 and FL2. However, FL2 does not exist. The information flow between FL1 and FL3 is used to approximate the backward and forward motion information to work around this issue. A linear blending function is used to combine the two sampled feature maps:
Here ???? and ???? are learnable 1 × 1 convolution kernels and ∗ is the convolution operator.
Deformable ConvLSTM
Temporal information is essential in video restoration tasks. Therefore, instead of reconstructing high-res frames from individual feature maps, Zooming Slow-Mo aggregates temporal contexts from neighbouring frames. It employs ConvLSTM, a popular 2D sequence data modelling method, to perform temporal aggregation. However, ConvLSTM can only capture motion between previous states and the current input feature map with small convolution receptive fields. This greatly limits ConvLSTMs ability to handle large motions.
When working with videos with large motions, this leads to a severe temporal mismatch between previous states and the current feature map FLt. As a result, the reconstructed high-resolution frame IHt suffers from artifacting. To overcome this problem and make better use of the global temporal contexts, a state updating cell with deformable alignment is embedded into ConvLSTM:
gh and gc are used to denote the general functions of several convolution layers, ∆pht and ∆pct are the learned offsets. hat−1 and cat−1 refer to the current feature map FLt aligned hidden and cell states respectively.In contrast to vanilla ConvLSTM, deformable ConvLSTM enforces the hidden state and cell state to align with the current feature map FLt. In addition to that, the Deformable ConvLSTM is used in a bidirectional manner to maximize the utilization of temporal information.
Frame Reconstruction
A temporally shared synthesis network is used for frame reconstruction; it synthesizes high-resolution frames from individual hidden states ht. The reconstruction network has 40 stacked residual blocks for learning deep features and uses PixelShuffle for sub-pixel upscaling to reconstruct high-res frames. A reconstruction loss function is used to optimize this network:
Here IGTt denotes the t-th ground-truth high-res video frame and ???? is set to 1 × 10−3.
Space-Time Video Super-Resolution using Zooming Slow-Mo
- Clone the Zooming Slow-Mo GitHub repository.
git clone --recursive https://github.com/Mukosame/Zooming-Slow-Mo-CVPR-2020.git
pip install -r requirements.txt
- Compile the deformable convolutional network V2
cd $ZOOMING_ROOT/codes/models/modules/DCNv2 bash make.sh
- Perform space-time video super-resolution using the
video_to_zsm.py
script.
python codes/video_to_zsm.py --model experiments/pretrained_models/xiang2020zooming.pth --video low-res-vid.mp4 --output low-res-vid.mp4 --N_out 3
Last Epoch
This article introduced Zooming Slow-Mo, a one-stage framework for space-time video super-resolution that directly synthesizes high frame rate, high-resolution videos without generating the intermediate low-resolution frames. It introduces a deformable feature interpolation network that enables feature-level temporal interpolation. Furthermore, it uses a modified deformable ConvLSTM for aggregating temporal information and handling large motions. Using its one-stage design, Zooming Slow-mo is able to make use of the intra-relatedness between temporal interpolation and spatial super-resolution. It outperforms existing state-of-the-art two-stage approaches not only in terms of effectiveness but also efficiency.
For a more in-depth understanding of Zooming Slow-Mo, refer to the following resources: