Top 10 Papers Presented At CVPR 2021

At the annual virtual computer vision event CVPR 2021, students, academics and researchers from across the globe, came together to celebrate the advancements in the field of artificial intelligence, machine learning and computer vision.

In this year’s CVPR event, close to 7,093 papers were submitted. Out of this, 7,039 were assigned to reviewers, while 4,312 papers were rejected, 1,047 were withdrawn, and 19 were desk rejected. In total, only about 1,660 papers made it to the poster and oral presentation (acceptance rate: 0.236).


Sign up for your weekly dose of what's up in emerging technology.

Further, most of the authors (of submitted papers) came from China (8,203), followed by the US (4,628), Korea (1062), UK (655), Germany (574), Canada (517), Australia (462), and India (429).

Michael Niemeyer’s work ‘GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields’ won the best paper award at CVPR 2021. ‘Task Programming: Learning Data-Efficient Behavior Representations,’ co-authored by researchers at Caltech and Northwestern University, won the best student paper award. 

Best paper honourable mention:

Best student paper honourable mention: 

We have curated the top papers presented at CVPR 2021. Here’s the list: 

Meta Pseudo Labels

Meta Pseudo Labels is a semi-supervised learning technique developed by researchers at Google Brain. This model has achieved a new state-of-the-art top-1 accuracy of 90.2 percent on ImageNet, which is 1.6 percent better than the existing SOTA models. The source code is available on GitHub

Animating Pictures With Eulerian Motion Fields

The paper, presented by researchers at the University of Washington, demonstrated a fully automatic method for converting a still image into a realistic animated looping video. The researchers have used an image-to-image translation network to encode motion priors of natural scenes collated from online videos. In this paper, they have demonstrated the effectiveness and robustness of the method by applying it to an extensive collection of examples, including waterfalls, beaches, flowing rivers, etc. 

Taming Transformers for High-Resolution Image Synthesis

Researchers from Heidelberg Collaboratory for Image Processing, IWR, and Heidelberg University, Germany, have combined the effectiveness of the inductive bias of CNNs with the expressivity of transformers to enable and synthesise high-resolution images. The paper shows how to use CNNs to learn a context-rich vocabulary of image constituents. The source code is available on GitHub

Real-Time High-Resolution Background Matting 

In this paper, the researchers from the University of Washington have shown a real-time, high-resolution background replacement technique that operates at 30fps in 4K resolution and 60fps for HD on a modern GPU. 

The researchers have used two neural networks; the base network computes a low-resolution result defined by a second neural network operating at high-resolution on selective patches. Also, they have introduced two large-scale video and image matting datasets, namely VideoMatte240K and PhotoMatte13K/85. The researchers found their approach yielded higher quality results than the previous SOTA in background matting while providing a dramatic boost in speed and resolution.  

RepVGG: Making VGG-Style ConvNets Great Again

In this paper, the researchers presented a simple yet powerful architecture of CNN, which has a VGG-like inference-time body composed of a stack of 3×3 convolution and ReLU, while the training-time model has a multi-branch topology.  

The outcome, on ImageNet, RepVGG saw over 80% top-1 accuracy, a first for a plain model. On the NVIDIA 1080 Ti GPU, RepVGG models ran 83% faster than ResNet-50 or 101% faster than ResNet-101 with higher accuracy and showed a favourable accuracy-speed trade-off than state-of-the-art models like RegNet and EfficientNet. The trained models and source code are available on GitHub

Natural Adversarial Examples

The researchers introduced two challenging datasets (ImageNet-A and ImageNet-O) that reliably cause machine learning model performance to degrade substantially. The datasets are collected with a simple adversarial filtration technique to create datasets with limited spurious cues. 

The researchers found the existing data augmentation techniques hardly improved performance, and using other public training datasets provided limited improvements. However, upon further analysis, they found modifications to computer vision architectures offered a promising path towards robust models. The code of the two datasets is available on GitHub

VirTex: Learning Visual Representations From Textual Annotations

In this paper, the researchers from the University of Michigan showed high-quality visual representations from fewer images. The researchers revisited supervised pre-training and sought data-efficient alternatives to classification-based pre-training to develop VirTex, a pre-training approach using semantically dense captions to learn visual representations. 

The researchers have trained convolutional networks from scratch on COCO captions and transferred them to downstream recognition tasks, including object detection, image classification, and instance segmentation. As a result, VirTex provided features that match or exceed those learned on ImageNet — supervised or unsupervised — despite using up to ten times fewer images. The code and pretrained models are available on GitHub

One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing

In this paper, NVIDIA researchers have proposed a neural-head video synthesis model and demonstrated its application in video conferencing. The model learns to synthesise a talking-head video using a source image containing the target person’s appearance and a driving video that dictates the motion in the output. The video versions of the paper figures and additional results are available on GitHub

Learning Continuous Image Representation With Local Implicit Image Function

Researchers from NVIDIA and UC San Diego showcased a continuous representation for images, where they have used Local Implicit Image Function (LIIF), which takes an image coordinate and the 2D deep features around the coordinates inputs, predicting the RGB value at a given coordinate as an output. 

The researchers have trained an encoder with LIIF representation via self-supervised tasks with super-resolution to generate the continuous representation for images. The source code is available on GitHub

Im2Vec: Synthesizing Vector Graphics Without Vector Supervision

The researchers from University College London and Adobe Research have proposed a new neural network that can generate complex vector graphics with varying topologies and only requires in-direct supervision from readily available raster training images. The researchers have used a differentiable rasterization pipeline that renders the generated vector shapes and composites them together onto a raster canvas. The experiment was conducted on a range of datasets and compared with SOTA SVG-VAE and DeepSVG, both of which require explicit vector graphic supervision. In addition to this, the researchers have demonstrated their approach to the MNIST dataset. The source code is available on GitHub

The Open Access Versions of all the papers reviewed at CVPR 2021 is available here

More Great AIM Stories

Amit Raja Naik
Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Masterclass, Virtual
How to power applications for the data-driven economy
20th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

What can SEBI learn from casinos?

It is said that casino AI technology comes with superior risk management systems compared to traditional data analytics that regulators are currently using.

Will Tesla Make (it) in India?

Tesla has struggled with optimising their production because Musk has been intent on manufacturing all the car’s parts independent of other suppliers since 2017.