Background replacement is a prominent task in the fields of video production, special effects and live streaming. And due to the increased prevalence of remote work and virtual classes, in video conferencing tools like Zoom and Google Meets. While it is used solely for the entertainment/aesthetic value in the commercial video production and editing industry, background replacement serves to enhance privacy in video conferencing tools.
This new video conferencing application’s major challenge is the lack of green screens and other physical props that are used to aid background replacement in conventional video production environments. Although tools are abundant for background replacements, most of these produce artifacts in areas containing fine details like hair strands or glasses. And the higher resolution image matting models are computationally intensive and frequently require manual input.
Shanchuan Lin, Andrey Ryabtsev et al. of the University of Washinton introduced a new method for background replacement in their paper “Real-Time High-Resolution Background Matting”. Their method is the first real-time high-resolution background replacement technique that produces state-of-the-art results at 4K at 30fps and HD at 60fps. It uses an additional background image to compute the alpha matt and the foreground layer.
Architecture & Approach
Matting is the process of separating the foreground and background of an image, this foreground can then be composited on a new background. To model the matting problem each pixel of the image is represented as a combination of the background B and foreground F.

For every pixel in a given image C, the objective is to solve for the foreground F, background B, and transparency alpha α. All these calculations have to be done for the three RGB channels, which incurs a high computation cost and memory consumption. The current state-of-the-art architecture operates on 512×512 resolution running only at 8fps.
The method proposed in the paper takes an input image I and the background image B and generates the alpha matte ɑ and foreground F, which can be composed over new backgrounds by I’ = αF + (1−α)B’, where B’ is the new background image. Conventionally background matting methods solve directly for F; the new method solves for the foreground residual FR = F − I instead. The foreground is then extracted from images by adding FR to input images with suitable clamping: F = max(min(FR + I, 1), 0). The author found that this formulation improved learning. Furthermore, the formulation enabled them to use lower resolution foreground residual for high-resolution input images through upsampling.

The architecture consists of two networks: a base network Gbase that works on downsampled input images and a refinement network Grefine that works only in a selection of regions with a higher likelihood of errors as predicted by the base network.
The input image I and the captured background image B are first downsampled by a factor of c to Ic and Bc. The base network Gbase takes these downsampled images as input and generates a coarse-grained alpha matte αc, a foreground residual FRc, an error prediction map Ec, and hidden features Hc. The refinement network Grefine then uses Hc, I, and B to refine the alpha matte αc and foreground residual FRc only in regions of the images where the predicted error Ec is large. It outputs the final alpha matt α and foreground residual FR at the original resolution, which can be used to composite the foreground on new background images.
Datasets
To train the model, the authors used multiple publicly available matting datasets such as the Adobe Image Matting (AIM) dataset and even created two of their own – VideoMatte240K and PhotoMatte13K/85.
- VideoMatte240K: 484 green videos were used to generate a total of 240,709 unique frames of alpha mattes and foregrounds. 384 of the videos are at 4K; the rest are in HD. It is split by 479:5 to form the train and validation sets. VideoMatte240K is the largest matting dataset publicly available, it is also the first video matting dataset that contains continuous sequences of frames instead of still images.
- PhotoMatte13K/85: 13,655 images shot with studio-quality lighting and cameras; these are split by 13,165: 500 to form the train and validation sets.
Replacing the Background of a Video
The following code has been taken from the official example notebook available here.
- Install
gdown
for downloading the example video and background images, or use your own video and image assrc.mp4
andbgr.png
!pip install gdown -q !gdown https://drive.google.com/uc?id=1tCEk8FE3WGrr49cdL8qMCqHptMCAtHRU -O /content/src.mp4 -q !gdown https://drive.google.com/uc?id=1wAR3JjnTO60B_DUr7ruIJj0Z2pcIGkyP -O /content/bgr.png -q
- Download one of the pre-trained models available here, clone the BackgroundMattingV2 GitHub repository and navigate into the newly created
BackgroundMattingV2
directory.
!gdown https://drive.google.com/uc?id=1ErIAsB_miVhYL9GDlYUmfbqlV293mSYf -O model.pth -q !git clone -q https://github.com/PeterL1n/BackgroundMattingV2 %cd BackgroundMattingV2
- Perform background separation using the inference_video.py script.
!python inference_video.py \ --model-type mattingrefine \ --model-backbone resnet50 \ --model-backbone-scale 0.25 \ --model-refine-mode sampling \ --model-refine-sample-pixels 80000 \ --model-checkpoint "/content/model.pth" \ --video-src "/content/src.mp4" \ --video-bgr "/content/bgr.png" \ --output-dir "/content/output/" \ --output-type com fgr err ref
Last Epoch
This article discussed a new technique for background replacement that can operate at 4K 30 fps or 1080p 60 fps. In addition to the input video, the method requires an image of that background that is readily available in most application. Instead of directly processing the high-resolution image with a neural network, the proposed architecture downsamples and marks regions with higher error rates, these regions are then refined using a higher resolution neural network. This reduces redundant computations and enables the new architecture to facilitate real-time background replacement. The authors have also created a webcam plugin application that works on Linux systems and can be used in Zoom meetings. You can find the script here.