Mattes are a crucial part of image and video editing operations. They help combine a foreground image, actors on a set, for example, with a background image such as a massive city. Recent computer vision techniques are capable of producing high-quality mattes for videos and images. However, the scene effects a subject generates, including reflections, smoke, shadows etc., remain ignored to date.
To fill this void, Google has introduced a novel method of creating mattes that use layered neural rendering to partition a video into layers known as omnimattes. The omnimattes capture both the subject and all effects associated with the subject in the scene. Omnimattes, like traditional mattes, are RGBA pictures that can be edited with a readily accessible image or video editing software and can be used wherever those conventional mattes are used, such as to insert text into a video beneath a smoke trail.
The work is presented in the paper titled ‘Omnimatte: Associating Objects and Their Effects in Video’, and to generate omnimattes, researchers split any input video into a set of layers:
- One layer for each moving subject
- An additional layer for stationary background objects
Take, for example, a boy walking with his dog on the road. So, the subjects — the boy and the dog, will have separate layers for them. In addition, the background around the road will have a separate layer attached to it. Finally, all these layers will be merged with the help of conventional alpha blending techniques, thereby reproducing the input video. For results, researchers used:
- Mask R-CNN to segment the input objects.
- STM, a video object segmenter that is trained on the DAVIS dataset, to track objects across frames.
- Utilisation of RAFT to compute the optical flow between consecutive frames.
- When it comes to dynamic background elements such as tree branches, researchers employ panoptic segmentation to segment them and treat the segments as additional objects.
Image Credits: Paper
The results generated by the paper include:
- Successful association of subjects with the scene effects related to them.
- The method can help remove a dynamic object from a video. This can be done either by binarising the omnimatte and using it as input to a separate video-completion method such as FGVC or by simply excluding the object’s omnimatte layer from the reconstruction.
- The model presented in the paper outperformed the existing best shadow detection method, i.e. ISD.
- It successfully captures the deformations, reflections, and shadows with a generic, much simpler input.
However, the model is unable to separate objects or effects that remain entirely stationary relative to the background throughout the video. “These issues could be addressed by building a background representation that explicitly models the 3D structure of the scene,” the paper concluded.
Tracing AI in Video Editing
In 2016, IBM used its Watson supercomputer to curate footage and create a trailer for the horror-thriller Morgan — one of the first applications of AI in video editing. Watson essentially utilised machine learning to study prior trailers, then used what it learnt to curate and select parts from the movie that it thought would be appropriate for the trailer. Although AI finishes off the job in a fraction of time, it would have taken human hours or days to watch the entire footage and produce the final video.
In 2016 itself, Adobe introduced its in-house AI and ML platform named Adobe Sensei, which offers several useful capabilities across its products. For example, Sensei AI may be used to quickly adjust and rectify flaws in pictures, videos, and other media in Adobe Creative Cloud products, including Photoshop, Premiere, and Illustrator, as well as for increased search capabilities in Adobe Stock and Lightroom. Similar tools like Quickstories from American technology firm GoPro, end-to-end video marketing tool Magisto, online video editing tool Rawshorts, Lumen5, and many other AI-based tools exist.
This ability of AI to interpret videos opens up the possibility for its use in almost any type of editing tool, from colour correction to object removal, image stabilisation, visual effects, etc. However, the case of “deep fakes”, such as politicians uttering words that they never said, remains a concern. Therefore, ethical and legal frameworks need to be put in place to address these issues in the future.