In a recent paper released by IBM, the company describes an AI system that generates videos that are seen during training, as well as the videos that are unseen. While Navsynth, the AI system, isn’t a novel idea, it’s indeed a deep interest for many companies like DeepMind and others. According to the researchers, this effort will produce superior quality videos compared with existing methods, which could be used to synthesise videos on which other AI systems train — supplementing real-world data sets that are incomplete or marred by corrupted samples.
Explaining further, the researchers stated, “the bulk of work in the video synthesis domain leverages GANs, or two-part neural networks consisting of generators that produce samples and discriminators that attempt to distinguish between the generated samples and real-world samples. They’re highly capable but suffer from a phenomenon called mode collapse, where the generator generates a limited diversity of samples (or even the same sample) regardless of the input.”
In contrast, IBM’s system consists of a variable that represents video content features, a frame-specific transient variable, a generator, and a recurrent machine learning model, which breaks videos down into a static constituent that captures the constant portion of the video common for all frames and a transient constituent that represents the temporal dynamics between all the frames in the video. Alongside, IBM’s system learns the static and transient constituents together, which it uses to generate videos at inference time.
In this paper, IBM’s researchers proposed a novel non-adversarial framework to generate videos in a controllable manner without any reference frame. Specifically, they proposed to synthesise videos from two optimiSed latent spaces, one providing control over the static portion of the video (latent static space) and the other over the transient portion of the video (transient latent space). The researchers proposed to jointly optimise these two spaces while optimising the network (a generative and a recurrent network) weights with the help of regression-based reconstruction loss and a triplet loss.
Our approach works as follows — the research team trained, validated, and tested the system on three publicly available data sets: Chair-CAD, which consists of 1,393 3D models of chairs (out of which 820 were chosen with the first 16 frames); Weizmann Human Action, which provides ten different actions performed by nine people, amounting to 90 videos; and the Golf scene data set, which contains 20,268 golf videos (out of which 500 videos were chosen).
Compared with the videos generated by several baseline models, the proposed method produces visually sharper and consistently better results using the non-adversarial training protocol. Moreover, it reportedly demonstrated a knack for frame interpolation or a form of video processing in which the intermediate frames are generated between the existing one in an attempt to make animation more fluid.