The 16th edition of the prestigious International Conference on Computer Vision (ICCV) is scheduled between October 2 and 6 in Paris, France. The event is expected to have over 2,000 participants globally, focusing on cutting-edge research in computer vision through oral and poster presentations, spanning diverse topics such as image and video processing, object detection, scene understanding, motion estimation, 3D vision, machine learning, and applications in robotics and healthcare. Meta, one of the pioneers of the field is also participating in the event with five of their recent research papers on the same topic.
This paper explores text-guided human motion generation, a field with broad applications in animation and robotics. While previous efforts using diffusion models have enhanced motion quality, they are constrained by small-scale motion capture data, resulting in sub-optimal performance for various real-world scenarios.
The authors propose Make-An-Animation (MAA), a novel text-conditioned human motion generation model. MAA stands out by learning from large-scale image-text datasets, allowing it to grasp more varied poses and prompts. The model is trained in two stages: initially on a sizable dataset of (text, static pseudo-pose) pairs from image-text datasets, and subsequently fine-tuned on motion capture data, incorporating additional layers for temporal modeling. In contrast to conventional diffusion models, MAA employs a U-Net architecture akin to recent text-to-video generation models. Through human evaluation, the model demonstrates state-of-the-art performance in terms of motion realism and alignment with input text in the realm of text-to-motion generation.
This study, in collaboration with Berkley AI Research and Kitware, introduces Scale-MAE, a novel pre-training method for large models commonly fine-tuned with augmented imagery. These models often don’t consider scale-specific details, especially in domains like remote sensing. Scale-MAE addresses this issue by explicitly learning relationships between data at different scales during pre-training. It masks input images at known scales, determining the ViT positional encoding scale based on the Earth’s area covered, not image resolution.
The masked images are encoded using a standard ViT backbone and then decoded through a bandpass filter, reconstructing low/high-frequency images at lower/higher scales. Tasking the network with reconstructing both frequencies results in robust multiscale representations for remote sensing imagery, outperforming current state-of-the-art models.
NeRF-Det is a new approach to indoor 3D detection using RGB images. Unlike existing methods, it leverages NeRF to explicitly estimate 3D geometry, improving detection performance. To overcome NeRF’s optimisation latency, the researchers incorporated geometry priors for better generalisation. By linking detection and NeRF via a shared MLP, they efficiently adapt NeRF for detection, yielding geometry-aware volumetric representations.
The method surpasses state-of-the-art benchmarks on ScanNet and ARKITScenes. Joint training enables NeRF-Det to generalise to new scenes for object detection, view synthesis, and depth estimation, eliminating the need for per-scene optimisation.
This paper addresses ethical concerns associated with generative image modeling by proposing an active strategy that integrates image watermarking and Latent Diffusion Models (LDM). The objective is to embed an invisible watermark in all generated images for future detection or identification.
The method rapidly refines the latent decoder of the image generator based on a binary signature. A pre-trained watermark extractor recovers the hidden signature, and a statistical test determines if the image originates from the generative model. The study evaluates the effectiveness and durability of the watermarks across various generation tasks, demonstrating the Stable Signature’s resilience even after image modifications. The approach aims to mitigate risks associated with the authenticity of AI-generated images, especially concerning issues like deep fakes and copyright misuse, by seamlessly integrating watermarking into the generation process of LDMs without requiring architectural changes. The method proves compatible with various LDM-based generative methods, providing a practical solution for responsible deployment and detection of generated images.
The conventional belief in the power of generation for grasping visual data is revisited in light of denoising diffusion models. While direct pre-training with these models falls short, a modified approach, where diffusion models are conditioned on masked input and framed as Masked Autoencoders (DiffMAE), proves effective.
This method serves as a robust initialisation for downstream tasks, excels in image inpainting, and extends effortlessly to videos, achieving top-tier classification accuracy. A comparison of design choices and a linkage between diffusion models and masked autoencoders are explored. The study questions whether generative pre-training can effectively compete in recognition tasks compared to other self-supervised methods. The work establishes connections between Masked Autoencoders and diffusion models while providing insights into the effectiveness of generative pre-training in the realm of visual understanding.