21st-may-banner design

Top 9 Papers Presented at Google CVPR 2023

CVPR saw an influx of 9,155 entries, out of which only 2,360, that is, 25.78% were accepted

Share

Listen to this story

The 2023 edition of CVPR, the prestigious annual conference for computer vision and pattern recognition is taking place from June 19 to 22 in Vancouver, Canada. Google Research is one of the major sponsors, presenting 90 papers on various topics like image recognition, 3D vision, and machine learning. Besides Google, several other premier institutes like MIT and UCLA are also participating this time. CVPR saw an influx of 9,155 entries, out of which only 2,360, that is, 25.78% were accepted. Let’s take a look at the top papers presented this time.

MobileNeRF: Exploiting the Polygon Rasterisation Pipeline for Efficient Neural Field Rendering on Mobile Architectures

Authored by a team of researchers from Google Research, Simon Fraser University and the University of Toronto, the paper introduces a new NeRF representation using textured polygons for efficient image synthesis. Traditional rendering techniques, combined with a view-dependent MLP, process the polygon features obtained through a z-buffer, resulting in fast rendering on diverse platforms, including mobile phones. 

DynIBaR: Neural Dynamic Image-Based Rendering

The paper presents a new method for generating realistic views from monocular videos of dynamic scenes. Existing techniques based on dynamic Neural Radiance Fields (NeRFs) struggle with long videos and complex camera movements, resulting in blurry or inaccurate outputs. Developed by Cornell Tech and Google Research, the new approach overcomes these limitations by using a volumetric image-based rendering framework that incorporates nearby views and motion information. The system achieves superior results on dynamic scene datasets and excels in real-world scenarios with challenging camera and object motion where previous methods fall short.

Read more: Neo4j’s Role in Fueling Generative AI with Graph Technology 

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

Large text-to-image models have limitations in mimicking subjects from a reference set and generating diverse renditions. To address this, Google Research and Boston University present a personalised approach. By fine-tuning the model with a few subject images, it learns to associate a unique identifier with the subject, enabling the synthesis of photorealistic images in different contexts. The technique preserves key features while exploring tasks like recontextualization, view synthesis, and artistic rendering. A new dataset and evaluation protocol are provided for a subject-driven generation. Check out their GitHub repository here.

MaskSketch: Unpaired Structure-Guided Masked Image Generation

Adding to the list of exciting innovations is a new image generation method called MaskSketch that allows spatial conditioning of the generation result using a guiding sketch as an extra signal was also proposed at the event. MaskSketch leverages a pre-trained masked generative transformer and works with sketches of different abstraction levels. 

By leveraging intermediate self-attention maps, MaskSketch encodes important structural information and enables structure-guided generation. The method achieves high image realism and fidelity, surpassing state-of-the-art methods for sketch-to-image translation and unpaired image-to-image translation approaches on benchmark datasets.

MAGVIT: Masked Generative Video Transformer

Carnegie Mellon University, Google Research, and Georgia Institute of Technology introduced MAGVIT, a single model designed to handle various video synthesis tasks. It uses a 3D tokenizer to convert videos into spatial-temporal visual tokens and employs masked video token modeling for efficient multi-task learning. Results demonstrate that MAGVIT outperforms state-of-the-art approaches, achieving the best-published FVD on three video generation benchmarks, including Kinetics-600. It also surpasses existing methods in inference time by a significant margin and supports ten diverse generation tasks while generalizing across different visual domains.

Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting

Google presented Imagen Editor, a cascaded diffusion model, addressing the challenge of text-guided image editing. By fine-tuning Imagen on text-guided image inpainting and using object detectors for proposing inpainting masks, it ensures edits align with text prompts. It also maintains fine details by conditioning on the high-resolution image. 

Evaluation using EditBench, a benchmark for text-guided image inpainting, shows that object-masking during training improves text-image alignment. Imagen Editor outperforms DALL-E 2 and Stable Diffusion and excels at object-rendering and material/color/size attributes over count/shape attributes.

RUST: Latent Neural Scene Representations from Unposed Imagery

Another paper presented by the Google team introduces RUST (Really Unposed Scene Representation Transformer), a pose-free approach using RGB images only. By training a Pose Encoder and Decoder, RUST enables novel view synthesis with meaningful camera transformations and accurate pose readouts. Surprisingly, RUST achieves similar quality to methods with perfect camera pose, allowing large-scale training of neural scene representations.

REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory

The paper presents REVEAL, an end-to-end Retrieval-Augmented Visual Language Model. REVEAL encodes world knowledge into a large-scale memory and retrieves from it to answer knowledge-intensive queries. It consists of a memory, encoder, retriever, and generator. The memory encodes various multimodal knowledge sources, and the retriever finds relevant entries. 

The generator combines retrieved knowledge with input queries to generate outputs. REVEAL achieves state-of-the-art performance in visual question answering and image captioning, utilizing diverse multimodal knowledge sources. The paper is submitted by members from the University of California, Los Angeles and Google Research. 

On Distillation of Guided Diffusion Models

Classifier-free guided diffusion models, widely used in image generation, suffer from computational inefficiency. Google, Stability AI and LMU Munich propose distilling these models into faster sampling models. The distilled model matches the output of combined conditional and unconditional models, achieving comparable image quality with fewer sampling steps. The approach is up to 256 times faster for pixel-space models and at least 10 times faster for latent-space models. It also proves effective in text-guided image editing and inpainting, requiring only 2-4 denoising steps for high-quality results.

Read more: What the World Can Learn From France’s AI Ecosystem

Share
Picture of Shritama Saha

Shritama Saha

Shritama (she/her) is a technology journalist at AIM who is passionate to explore the influence of AI on different domains including fashion, healthcare and banks.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe

Subscribe to our Youtube channel and see how AI ecosystem works.

There must be a reason why +150K people have chosen to follow us on Linkedin. 😉

Stay in the know with our Linkedin page. Follow us and never miss an update on AI!