Introducing Real-Time High-Resolution Background Replacement

Background replacement is an important task in video production, special effects and live streaming.
background replacement

Background replacement is a prominent task in the fields of video production, special effects and live streaming.  And due to the increased prevalence of remote work and virtual classes, in video conferencing tools like Zoom and Google Meets. While it is used solely for the entertainment/aesthetic value in the commercial video production and editing industry, background replacement serves to enhance privacy in video conferencing tools. 

This new video conferencing application’s major challenge is the lack of green screens and other physical props that are used to aid background replacement in conventional video production environments. Although tools are abundant for background replacements, most of these produce artifacts in areas containing fine details like hair strands or glasses. And the higher resolution image matting models are computationally intensive and frequently require manual input.

Shanchuan LinAndrey Ryabtsev et al. of the University of Washinton introduced a new method for background replacement in their paper “Real-Time High-Resolution Background Matting”. Their method is the first real-time high-resolution background replacement technique that produces state-of-the-art results at 4K at 30fps and HD at 60fps. It uses an additional background image to compute the alpha matt and the foreground layer. 

Architecture & Approach

Matting is the process of separating the foreground and background of an image, this foreground can then be composited on a new background. To model the matting problem each pixel of the image is represented as a combination of the background B and foreground F.

Matting Equation

For every pixel in a given image C, the objective is to solve for the foreground F, background B, and transparency alpha α. All these calculations have to be done for the three RGB channels, which incurs a high computation cost and memory consumption. The current state-of-the-art architecture operates on 512×512 resolution running only at 8fps. 

The method proposed in the paper takes an input image I and the background image B and generates the alpha matte ɑ and foreground F, which can be composed over new backgrounds by I’ = αF + (1−α)B’, where B’ is the new background image. Conventionally background matting methods solve directly for F; the new method solves for the foreground residual FR = F − I instead. The foreground is then extracted from images by adding FR to input images with suitable clamping: F = max(min(FR + I, 1), 0). The author found that this formulation improved learning. Furthermore, the formulation enabled them to use lower resolution foreground residual for high-resolution input images through upsampling. 

Architecture of the new background replacement technique

The architecture consists of two networks: a base network Gbase that works on downsampled input images and a refinement network Grefine that works only in a selection of regions with a higher likelihood of errors as predicted by the base network.

The input image I and the captured background image B are first downsampled by a factor of c to Ic and Bc. The base network Gbase takes these downsampled images as input and generates a coarse-grained alpha matte αc, a foreground residual FRc, an error prediction map Ec, and hidden features Hc. The refinement network Grefine then uses Hc, I, and B to refine the alpha matte αc and foreground residual FRc only in regions of the images where the predicted error Ec is large. It outputs the final alpha matt α and foreground residual FR at the original resolution, which can be used to composite the foreground on new background images.


To train the model, the authors used multiple publicly available matting datasets such as the Adobe Image Matting (AIM) dataset and even created two of their own – VideoMatte240K and PhotoMatte13K/85.

  1. VideoMatte240K: 484 green videos were used to generate a total of 240,709 unique frames of alpha mattes and foregrounds. 384 of the videos are at 4K; the rest are in HD. It is split by 479:5 to form the train and validation sets. VideoMatte240K is the largest matting dataset publicly available, it is also the first video matting dataset that contains continuous sequences of frames instead of still images. 
  1. PhotoMatte13K/85: 13,655 images shot with studio-quality lighting and cameras; these are split by 13,165: 500 to form the train and validation sets. 

Replacing the Background of a Video

The following code has been taken from the official example notebook available here.

  1. Install gdown for downloading the example video and background images, or use your own video and image as src.mp4 and bgr.png
 !pip install gdown -q
 !gdown -O /content/src.mp4 -q
 !gdown -O /content/bgr.png -q 
The input video and background capture image
  1. Download one of the pre-trained models available here, clone the BackgroundMattingV2 GitHub repository and navigate into the newly created BackgroundMattingV2 directory. 
 !gdown -O model.pth -q
 !git clone -q
 %cd BackgroundMattingV2 
  1. Perform background separation using the script.
 !python \
         --model-type mattingrefine \
         --model-backbone resnet50 \
         --model-backbone-scale 0.25 \
         --model-refine-mode sampling \
         --model-refine-sample-pixels 80000 \
         --model-checkpoint "/content/model.pth" \
         --video-src "/content/src.mp4" \
         --video-bgr "/content/bgr.png" \
         --output-dir "/content/output/" \
         --output-type com fgr err ref 
Output of the new background replacement technique

Last Epoch

This article discussed a new technique for background replacement that can operate at 4K 30 fps or 1080p 60 fps. In addition to the input video, the method requires an image of that background that is readily available in most application. Instead of directly processing the high-resolution image with a neural network, the proposed architecture downsamples and marks regions with higher error rates, these regions are then refined using a higher resolution neural network. This reduces redundant computations and enables the new architecture to facilitate real-time background replacement. The authors have also created a webcam plugin application that works on Linux systems and can be used in Zoom meetings. You can find the script here


Download our Mobile App

Aditya Singh
A machine learning enthusiast with a knack for finding patterns. In my free time, I like to delve into the world of non-fiction books and video essays.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Bangalore

Future Ready | Lead the AI Era Summit

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

20th June | Bangalore

Women in Data Science (WiDS) by Intuit India

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Can Apple Save Meta?

The iPhone kicked off the smartphone revolution and saved countless companies. Could the Pro Reality headset do the same for Meta?