In recent times, image/video editing has seen many new changes and advances such as background changes, efficient foreground detection, adding filters, enhancing image quality etc. From regular mobile cameras to social media, these things have gained much popularity.
Image/Digital Matting is a popular use case in computer vision. It refers to accurate foreground detection in images and video. The matte defines which pixels are to be present in the foreground, and which are to be in the background, and some pixels along the boundary or the blended region or semi-transparent regions the matte defines the mixture of foreground and background at each pixel, these pixels are called partial or mixed pixels. It is one of the key techniques used in film production for creating visual effects and has been constantly under research. It is different from image segmentation as there are no pixels in common for foreground and background; every pixel is identified separately.
For portrait matting, some existing methods use a green screen to attain alpha values for blended regions. If this green screen is unavailable, a trimap is set. But trimap is costly and hard to set up. Therefore semantic estimation is needed for locating backgrounds to predict the alpha matte. Proposing a trimap free method will keep the foreground focus limited to humans only.
In a recent paper by a group of researchers Zhanghan Ke, Kaican Li, Yurou Zhou, Qiuhua Wu, Xiangyu Mao, Qiong Yan, Rynson W. H. Lau: “Is a Green Screen Really Necessary for Real-Time Portrait Matting?” on 29 Nov 2020 MODNet is a matting decomposition network for sub-objective consistency(SOC) from a single input image in real-time under varying scene changes. It is designed on neural networks to be along with a self-supervised strategy and one-frame delay(OFD) to smooth portrait sequence. It is easy to implement as it’s a trimap-free method and runs at 63fps. It predicts a perfect alpha matte from only one RGB image by training a single model preferably on GPU.
Workflow framework is given below:
(a) Labelled training data with nearly 3k labelled foregrounds.
(b) Model has been trained on 400 unlabeled video clips that are divided into nearly 50,000 frames. These are downloaded from the internet to perform SOC on MODNet.
(c) OFD is applied to smoothen the images.
This data is not sufficient for trimap free method and hence increases challenges.
- Paper: link
- GitHub: Repo Link
- Try offline demo: Gradio app
- Try Realtime demo: Colab notebook
- Video Demo: https://youtu.be/PqJ3BRHX3Lc
MODNet Architecture
Starting with an input image I, MODNet will predict human semantics (sp), boundary details (dp), and final alpha matte (αp) through three interdependent branches, S(low-resolution branch), D(high-resolution branch), and F(fusion branch), respectively which are constrained by specific supervised ground truth matte (αg). The decomposed sub-objectives are interrelated and they strengthen each other. MODNet can be optimized end-to-end.
The low-resolution branch(S) identifies humans by Semantic Estimation using MobileNetV2 architecture to enhance real-time interfaces.
Benchmark Results
The paper contains comparison results based on different trimap and trimap free methods validated on specific benchmarked video matting datasets.
How to run the model in your system:
- To run the model in the cloud: Colab
- To locally run on system: instructions
run.py – source code from Github repository
Importing libraries
import numpy as np import cv2 from PIL import Image import torch import torch.nn as nn import torchvision.transforms as transforms
Normalisation function
torch_transforms = transforms.Compose( [ transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), ] )
Load the pretrained model
print('Load pre-trained MODNet...') pretrained_ckpt = './pretrained/modnet_webcam_portrait_matting.ckpt' modnet = MODNet(backbone_pretrained=False) modnet = nn.DataParallel(modnet)
Check if GPU is present otherwise run on CPU
GPU = True if torch.cuda.device_count() > 0 else False if GPU: print('Use GPU...') modnet = modnet.cuda() modnet.load_state_dict(torch.load(pretrained_ckpt)) else: print('Use CPU...') modnet.load_state_dict(torch.load(pretrained_ckpt, map_location=torch.device('cpu')))
Model evaluation
modnet.eval()
Initialize the webcam and set parameters
print('Init WebCam...') cap = cv2.VideoCapture(0) cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280) cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)
Webcam is fed in frames
print('Start matting...') while(True): _, frame_np = cap.read() frame_np = cv2.cvtColor(frame_np, cv2.COLOR_BGR2RGB) frame_np = cv2.resize(frame_np, (910, 512), cv2.INTER_AREA) frame_np = frame_np[:, 120:792, :] frame_np = cv2.flip(frame_np, 1)
Starting the matting process
frame_PIL = Image.fromarray(frame_np) frame_tensor = torch_transforms(frame_PIL) frame_tensor = frame_tensor[None, :, :, :] if GPU: frame_tensor = frame_tensor.cuda() with torch.no_grad(): matte_tensor = matte_tensor.repeat(1, 3, 1, 1) matte_np = matte_tensor[0].data.cpu().numpy().transpose(1, 2, 0) fg_np = matte_np * frame_np + (1 - matte_np) * np.full(frame_np.shape, 255.0) view_np = np.uint8(np.concatenate((frame_np, fg_np), axis=1)) view_np = cv2.cvtColor(view_np, cv2.COLOR_RGB2BGR) cv2.imshow('MODNet - WebCam [Press \'Q\' To Exit]', view_np) if cv2.waitKey(1) & 0xFF == ord('q'): break
End Notes
Soon in January, the training code will be released. MODNet being fast, efficient and optimised can run well on mobile devices. Compared to other traditional methods, it suffers less from the domain shift problem. It has shown high performance designed PPM-100 benchmarks on a diverse set of real-world data. Limitations of this method are unable to handle strange costumes, and strong motion blurs. In future releases, we can expect these problems to be solved.