Listen to this story
|
The 2023 edition of CVPR, the prestigious annual conference for computer vision and pattern recognition is taking place from June 19 to 22 in Vancouver, Canada. Google Research is one of the major sponsors, presenting 90 papers on various topics like image recognition, 3D vision, and machine learning. Besides Google, several other premier institutes like MIT and UCLA are also participating this time. CVPR saw an influx of 9,155 entries, out of which only 2,360, that is, 25.78% were accepted. Let’s take a look at the top papers presented this time.
Authored by a team of researchers from Google Research, Simon Fraser University and the University of Toronto, the paper introduces a new NeRF representation using textured polygons for efficient image synthesis. Traditional rendering techniques, combined with a view-dependent MLP, process the polygon features obtained through a z-buffer, resulting in fast rendering on diverse platforms, including mobile phones.
DynIBaR: Neural Dynamic Image-Based Rendering
The paper presents a new method for generating realistic views from monocular videos of dynamic scenes. Existing techniques based on dynamic Neural Radiance Fields (NeRFs) struggle with long videos and complex camera movements, resulting in blurry or inaccurate outputs. Developed by Cornell Tech and Google Research, the new approach overcomes these limitations by using a volumetric image-based rendering framework that incorporates nearby views and motion information. The system achieves superior results on dynamic scene datasets and excels in real-world scenarios with challenging camera and object motion where previous methods fall short.
Read more: Neo4j’s Role in Fueling Generative AI with Graph Technology
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
Large text-to-image models have limitations in mimicking subjects from a reference set and generating diverse renditions. To address this, Google Research and Boston University present a personalised approach. By fine-tuning the model with a few subject images, it learns to associate a unique identifier with the subject, enabling the synthesis of photorealistic images in different contexts. The technique preserves key features while exploring tasks like recontextualization, view synthesis, and artistic rendering. A new dataset and evaluation protocol are provided for a subject-driven generation. Check out their GitHub repository here.
MaskSketch: Unpaired Structure-Guided Masked Image Generation
Adding to the list of exciting innovations is a new image generation method called MaskSketch that allows spatial conditioning of the generation result using a guiding sketch as an extra signal was also proposed at the event. MaskSketch leverages a pre-trained masked generative transformer and works with sketches of different abstraction levels.
By leveraging intermediate self-attention maps, MaskSketch encodes important structural information and enables structure-guided generation. The method achieves high image realism and fidelity, surpassing state-of-the-art methods for sketch-to-image translation and unpaired image-to-image translation approaches on benchmark datasets.
MAGVIT: Masked Generative Video Transformer
Carnegie Mellon University, Google Research, and Georgia Institute of Technology introduced MAGVIT, a single model designed to handle various video synthesis tasks. It uses a 3D tokenizer to convert videos into spatial-temporal visual tokens and employs masked video token modeling for efficient multi-task learning. Results demonstrate that MAGVIT outperforms state-of-the-art approaches, achieving the best-published FVD on three video generation benchmarks, including Kinetics-600. It also surpasses existing methods in inference time by a significant margin and supports ten diverse generation tasks while generalizing across different visual domains.
Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting
Google presented Imagen Editor, a cascaded diffusion model, addressing the challenge of text-guided image editing. By fine-tuning Imagen on text-guided image inpainting and using object detectors for proposing inpainting masks, it ensures edits align with text prompts. It also maintains fine details by conditioning on the high-resolution image.
Evaluation using EditBench, a benchmark for text-guided image inpainting, shows that object-masking during training improves text-image alignment. Imagen Editor outperforms DALL-E 2 and Stable Diffusion and excels at object-rendering and material/color/size attributes over count/shape attributes.
RUST: Latent Neural Scene Representations from Unposed Imagery
Another paper presented by the Google team introduces RUST (Really Unposed Scene Representation Transformer), a pose-free approach using RGB images only. By training a Pose Encoder and Decoder, RUST enables novel view synthesis with meaningful camera transformations and accurate pose readouts. Surprisingly, RUST achieves similar quality to methods with perfect camera pose, allowing large-scale training of neural scene representations.
The paper presents REVEAL, an end-to-end Retrieval-Augmented Visual Language Model. REVEAL encodes world knowledge into a large-scale memory and retrieves from it to answer knowledge-intensive queries. It consists of a memory, encoder, retriever, and generator. The memory encodes various multimodal knowledge sources, and the retriever finds relevant entries.
The generator combines retrieved knowledge with input queries to generate outputs. REVEAL achieves state-of-the-art performance in visual question answering and image captioning, utilizing diverse multimodal knowledge sources. The paper is submitted by members from the University of California, Los Angeles and Google Research.
On Distillation of Guided Diffusion Models
Classifier-free guided diffusion models, widely used in image generation, suffer from computational inefficiency. Google, Stability AI and LMU Munich propose distilling these models into faster sampling models. The distilled model matches the output of combined conditional and unconditional models, achieving comparable image quality with fewer sampling steps. The approach is up to 256 times faster for pixel-space models and at least 10 times faster for latent-space models. It also proves effective in text-guided image editing and inpainting, requiring only 2-4 denoising steps for high-quality results.
Read more: What the World Can Learn From France’s AI Ecosystem