Top 10 Papers Presented At CVPR 2021

Share

Published on June 23, 2021

by Amit Raja Naik

At the annual virtual computer vision event CVPR 2021, students, academics and researchers from across the globe, came together to celebrate the advancements in the field of artificial intelligence, machine learning and computer vision.

In this year’s CVPR event, close to 7,093 papers were submitted. Out of this, 7,039 were assigned to reviewers, while 4,312 papers were rejected, 1,047 were withdrawn, and 19 were desk rejected. In total, only about 1,660 papers made it to the poster and oral presentation (acceptance rate: 0.236).

Further, most of the authors (of submitted papers) came from China (8,203), followed by the US (4,628), Korea (1062), UK (655), Germany (574), Canada (517), Australia (462), and India (429).

Michael Niemeyer’s work ‘GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields’ won the best paper award at CVPR 2021. ‘Task Programming: Learning Data-Efficient Behavior Representations,’ co-authored by researchers at Caltech and Northwestern University, won the best student paper award.

This is unbelievable: Michael Niemeyer @Mi_Niemeyer won the CVPR 2021 best paper award!! Thanks to Michael and all my fantastic students and postdocs for their extraordinary work! You are just amazing.. pic.twitter.com/bLQlVSQmYW
— Autonomous Vision Group (@AutoVisionGroup) June 21, 2021

Best paper honourable mention:

Exploring Simple Siamese Representation Learning (Facebook AI Research (FAIR))
Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos (University of Minnesota)

Best student paper honourable mention:

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling (UNC-Chapel Hill and Microsoft Dynamics 365 AI Research)
Binary TTC: A Temporal Geofence for Autonomous Navigation (NVIDIA and UC Santa Barbara)
Real-Time High-Resolution Background Matting (University of Washington)

We have curated the top papers presented at CVPR 2021. Here’s the list:

Meta Pseudo Labels

Meta Pseudo Labels is a semi-supervised learning technique developed by researchers at Google Brain. This model has achieved a new state-of-the-art top-1 accuracy of 90.2 percent on ImageNet, which is 1.6 percent better than the existing SOTA models. The source code is available on GitHub.

Some nice improvement on ImageNet: 90% top-1 accuracy has been achieved :-)

This result is possible by using Meta Pseudo Labels, a semi-supervised learning method, to train EfficientNet-L2.

More details here: https://t.co/kiZzT4RNj7 pic.twitter.com/aKLQ4KTVVr
— Quoc Le (@quocleix) January 13, 2021

Animating Pictures With Eulerian Motion Fields

The paper, presented by researchers at the University of Washington, demonstrated a fully automatic method for converting a still image into a realistic animated looping video. The researchers have used an image-to-image translation network to encode motion priors of natural scenes collated from online videos. In this paper, they have demonstrated the effectiveness and robustness of the method by applying it to an extensive collection of examples, including waterfalls, beaches, flowing rivers, etc.

Excited to show off our new project on single-image cinemagraphs. Our method automatically turns a _single image_ into a seamlessly looping video!

Website: https://t.co/nyySiXs9qk
Video: https://t.co/xGbjXF64TM

w/ Brian Curless, Steve Seitz, Rick Szeliski
More in thread! [1/5] pic.twitter.com/mr0KweXJ2r
— Aleksander Holynski (@holynski_) December 1, 2020

Taming Transformers for High-Resolution Image Synthesis

Researchers from Heidelberg Collaboratory for Image Processing, IWR, and Heidelberg University, Germany, have combined the effectiveness of the inductive bias of CNNs with the expressivity of transformers to enable and synthesise high-resolution images. The paper shows how to use CNNs to learn a context-rich vocabulary of image constituents. The source code is available on GitHub.

Taming Transformers for High-Resolution Image Synthesis https://t.co/6zdyT0HaR0 impressive work/results! (also fun to see a shoutout and my minGPT code used for the transformer :)) pic.twitter.com/cApDT7Yf67
— Andrej Karpathy (@karpathy) February 21, 2021

Real-Time High-Resolution Background Matting

In this paper, the researchers from the University of Washington have shown a real-time, high-resolution background replacement technique that operates at 30fps in 4K resolution and 60fps for HD on a modern GPU.

The researchers have used two neural networks; the base network computes a low-resolution result defined by a second neural network operating at high-resolution on selective patches. Also, they have introduced two large-scale video and image matting datasets, namely VideoMatte240K and PhotoMatte13K/85. The researchers found their approach yielded higher quality results than the previous SOTA in background matting while providing a dramatic boost in speed and resolution.

https://twitter.com/SenguptRoni/status/1338658762555596801

RepVGG: Making VGG-Style ConvNets Great Again

In this paper, the researchers presented a simple yet powerful architecture of CNN, which has a VGG-like inference-time body composed of a stack of 3×3 convolution and ReLU, while the training-time model has a multi-branch topology.

The outcome, on ImageNet, RepVGG saw over 80% top-1 accuracy, a first for a plain model. On the NVIDIA 1080 Ti GPU, RepVGG models ran 83% faster than ResNet-50 or 101% faster than ResNet-101 with higher accuracy and showed a favourable accuracy-speed trade-off than state-of-the-art models like RegNet and EfficientNet. The trained models and source code are available on GitHub.

https://twitter.com/karpathy/status/1381110640186564612

Natural Adversarial Examples

The researchers introduced two challenging datasets (ImageNet-A and ImageNet-O) that reliably cause machine learning model performance to degrade substantially. The datasets are collected with a simple adversarial filtration technique to create datasets with limited spurious cues.

The researchers found the existing data augmentation techniques hardly improved performance, and using other public training datasets provided limited improvements. However, upon further analysis, they found modifications to computer vision architectures offered a promising path towards robust models. The code of the two datasets is available on GitHub.

Natural Adversarial Examples are real-world and unmodified examples which cause classifiers to be consistently confused. The new dataset has 7,500 images, which we personally labeled over several months.
Paper: https://t.co/L5YMIHZGbz
Dataset and code: https://t.co/4UzQWC4TjG pic.twitter.com/pd75CyK54T
— Dan Hendrycks (@DanHendrycks) July 17, 2019

VirTex: Learning Visual Representations From Textual Annotations

In this paper, the researchers from the University of Michigan showed high-quality visual representations from fewer images. The researchers revisited supervised pre-training and sought data-efficient alternatives to classification-based pre-training to develop VirTex, a pre-training approach using semantically dense captions to learn visual representations.

The researchers have trained convolutional networks from scratch on COCO captions and transferred them to downstream recognition tasks, including object detection, image classification, and instance segmentation. As a result, VirTex provided features that match or exceed those learned on ImageNet — supervised or unsupervised — despite using up to ten times fewer images. The code and pretrained models are available on GitHub.

Introducing "VirTex": a pretraining approach to learn visual features via language using fewer images.
Pretrain: CNN+Transformer from scratch on COCO Captions.
Transfer CNN: Results on 6 vision tasks match/exceed ImageNet pretraining (10x size wrt COCO)!https://t.co/A3F00jmT9N pic.twitter.com/WnbLkktE1C
— Karan Desai (KD) (@kdexd) June 12, 2020

One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing

In this paper, NVIDIA researchers have proposed a neural-head video synthesis model and demonstrated its application in video conferencing. The model learns to synthesise a talking-head video using a source image containing the target person’s appearance and a driving video that dictates the motion in the output. The video versions of the paper figures and additional results are available on GitHub,

Check out our new work on face-vid2vid, a neural talking-head model for video conferencing that is 10x more bandwidth efficient than H264
arxiv https://t.co/g8nLvnQwnG
project https://t.co/u7KnaTgxTr
video https://t.co/eJdREPdWRB @tcwang0509 @arunmallya #GAN pic.twitter.com/kI1R2KzQBI
— Ming-Yu Liu (@liu_mingyu) December 1, 2020

Learning Continuous Image Representation With Local Implicit Image Function

Researchers from NVIDIA and UC San Diego showcased a continuous representation for images, where they have used Local Implicit Image Function (LIIF), which takes an image coordinate and the 2D deep features around the coordinates inputs, predicting the RGB value at a given coordinate as an output.

The researchers have trained an encoder with LIIF representation via self-supervised tasks with super-resolution to generate the continuous representation for images. The source code is available on GitHub.

New work with Yinbo Chen, one of my first PhD students: Learning Continuous Image Representation
with Local Implicit Image Function. Check our video showing images in arbitrary resolutions.

proj: https://t.co/5GHK1URPdA
code: https://t.co/x20Q1NzX3Q @YinboChen @SifeiL
(1/n) pic.twitter.com/xwJP1hIhwd
— Xiaolong Wang (@xiaolonw) December 17, 2020

Im2Vec: Synthesizing Vector Graphics Without Vector Supervision

The researchers from University College London and Adobe Research have proposed a new neural network that can generate complex vector graphics with varying topologies and only requires in-direct supervision from readily available raster training images. The researchers have used a differentiable rasterization pipeline that renders the generated vector shapes and composites them together onto a raster canvas. The experiment was conducted on a range of datasets and compared with SOTA SVG-VAE and DeepSVG, both of which require explicit vector graphic supervision. In addition to this, the researchers have demonstrated their approach to the MNIST dataset. The source code is available on GitHub.

Im2Vec: Synthesizing Vector Graphics without Vector Supervision

Machine learning will soon surpass hand-engineered image vectorization methods (which are already quite good), but will allow us to more interesting things with the learned representations.https://t.co/KEsIYdBSW0 pic.twitter.com/hxnUa7NCY7
— hardmaru (@hardmaru) March 15, 2021

The Open Access Versions of all the papers reviewed at CVPR 2021 is available here.

Access all our open Survey & Awards Nomination forms in one place

Amit Raja Naik

Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.