MODNet – A Trimap & Green Screen free Solution For Real-time Portrait Matting

MODNet is a matting decomposition network for sub-objective consistency(SOC) from a single input image in real-time under varying scene changes. It is designed on neural networks to be along with a self-supervised strategy and one-frame delay(OFD) to smooth portrait sequence. It is easy to implement as it's a trimap-free method and runs at 63fps.

In recent times, image/video editing has seen many new changes and advances such as background changes, efficient foreground detection, adding filters, enhancing image quality etc. From regular mobile cameras to social media, these things have gained much popularity.

Image/Digital Matting is a popular use case in computer vision. It refers to accurate foreground detection in images and video. The matte defines which pixels are to be present in the foreground, and which are to be in the background, and some pixels along the boundary or the blended region or semi-transparent regions the matte defines the mixture of foreground and background at each pixel, these pixels are called partial or mixed pixels. It is one of the key techniques used in film production for creating visual effects and has been constantly under research. It is different from image segmentation as there are no pixels in common for foreground and background; every pixel is identified separately. 

For portrait matting, some existing methods use a green screen to attain alpha values for blended regions. If this green screen is unavailable, a trimap is set. But trimap is costly and hard to set up. Therefore semantic estimation is needed for locating backgrounds to predict the alpha matte. Proposing a trimap free method will keep the foreground focus limited to humans only.

In a recent paper by a group of researchers Zhanghan Ke, Kaican Li, Yurou Zhou, Qiuhua Wu, Xiangyu Mao, Qiong Yan, Rynson W. H. Lau: “Is a Green Screen Really Necessary for Real-Time Portrait Matting?” on 29 Nov 2020 MODNet is a matting decomposition network for sub-objective consistency(SOC) from a single input image in real-time under varying scene changes. It is designed on neural networks to be along with a self-supervised strategy and one-frame delay(OFD) to smooth portrait sequence. It is easy to implement as it’s a trimap-free method and runs at 63fps. It predicts a perfect alpha matte from only one RGB image by training a single model preferably on GPU. 

Workflow framework is given below:

(a) Labelled training data with nearly 3k labelled foregrounds.

(b) Model has been trained on 400 unlabeled video clips that are divided into nearly 50,000 frames. These are downloaded from the internet to perform SOC on MODNet. 

(c)  OFD is applied to smoothen the images.

This data is not sufficient for trimap free method and hence increases challenges.

MODNet Architecture

Starting with an input image I, MODNet will predict human semantics (sp), boundary details (dp), and final alpha matte (αp) through three interdependent branches, S(low-resolution branch), D(high-resolution branch), and F(fusion branch), respectively which are constrained by specific supervised ground truth matte (αg). The decomposed sub-objectives are interrelated and they strengthen each other. MODNet can be optimized end-to-end. 

The low-resolution branch(S) identifies humans by Semantic Estimation using MobileNetV2 architecture to enhance real-time interfaces. 

Benchmark Results

The paper contains comparison results based on different trimap and trimap free methods validated on specific benchmarked video matting datasets.

How to run the model in your system:

  1. To run the model in the cloud: Colab
  2. To locally run on system: instructions – source code from Github repository

Importing libraries

 import numpy as np
 import cv2
 from PIL import Image
 import torch
 import torch.nn as nn
 import torchvision.transforms as transforms 

Normalisation function

torch_transforms = transforms.Compose(
                transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
Load the pretrained model
print('Load pre-trained MODNet...')
pretrained_ckpt = './pretrained/modnet_webcam_portrait_matting.ckpt'
modnet = MODNet(backbone_pretrained=False)
modnet = nn.DataParallel(modnet)

Check if GPU is present otherwise run on CPU

GPU = True if torch.cuda.device_count() > 0 else False
if GPU:
   print('Use GPU...')
   modnet = modnet.cuda()
   print('Use CPU...')
   modnet.load_state_dict(torch.load(pretrained_ckpt,    map_location=torch.device('cpu')))

Model evaluation


Initialize the webcam and set parameters

print('Init WebCam...')
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)

Webcam is fed in frames 

print('Start matting...')
    _, frame_np =
    frame_np = cv2.cvtColor(frame_np, cv2.COLOR_BGR2RGB)
    frame_np = cv2.resize(frame_np, (910, 512), cv2.INTER_AREA)
    frame_np = frame_np[:, 120:792, :]
    frame_np = cv2.flip(frame_np, 1)   

Starting the matting process

    frame_PIL = Image.fromarray(frame_np)
    frame_tensor = torch_transforms(frame_PIL)
    frame_tensor = frame_tensor[None, :, :, :]
    if GPU:
       frame_tensor = frame_tensor.cuda()
    with torch.no_grad():
       matte_tensor = matte_tensor.repeat(1, 3, 1, 1)
    matte_np = matte_tensor[0].data.cpu().numpy().transpose(1, 2, 0)
    fg_np = matte_np * frame_np + (1 - matte_np) * np.full(frame_np.shape, 255.0)
    view_np = np.uint8(np.concatenate((frame_np, fg_np), axis=1))
    view_np = cv2.cvtColor(view_np, cv2.COLOR_RGB2BGR)
    cv2.imshow('MODNet - WebCam [Press \'Q\' To Exit]', view_np)
    if cv2.waitKey(1) & 0xFF == ord('q'):

End Notes

Soon in January, the training code will be released. MODNet being fast, efficient and optimised can run well on mobile devices. Compared to other traditional methods, it suffers less from the domain shift problem. It has shown high performance designed PPM-100 benchmarks on a diverse set of real-world data. Limitations of this method are unable to handle strange costumes, and strong motion blurs. In future releases, we can expect these problems to be solved.

Download our Mobile App

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox