Listen to this story
Earlier this month, NVIDIA announced that it would be enabling the beta release of Omniverse, a platform where developers and creators can build Metaverse applications. In this way, the company has aligned its future along the metaverse vision, with the new platform allowing its users to create “digital twins” to simulate the real world.
One such step towards the realisation of such a dream that would help users to render a high-resolution 3D model for any 2D image input, or textual prompt, is Magic3D. Recently released by NVIDIA researchers, Magic3D is a text-to-3D synthesis model that creates high-quality 3D mesh models.
The model is a response to Google’s DreamFusion, in which the team used a pre-trained text-to-image diffusion model, circumventing the impossibility of having large-scale labelled 3D datasets, to optimise Neural Radiance Fields (NeRF). Magic3D addresses two limitations of DreamFusion—extremely slow optimisation of NeRF, and low-resolution image space supervision on NeRF.
Sign up for your weekly dose of what's up in emerging technology.
The model is based on a coarse-to-fine strategy that uses both low- and high-resolution diffusion prior to learning the 3D representation of the target image. As a result, the method can generate high-quality 3D mesh models in 40 minutes, averaging two times faster than DreamFusion, while at the same time obtaining eight times higher resolution supervision.
NVIDIA utilises a two-stage optimisation framework to achieve fast and high-quality 3D output to the text prompt.
The first step in the process is obtaining a coarse model using a low-resolution diffusion prior and optimizing neural field representations (colour, density, and normal fields). As part of the second step, the textured 3D mesh is differentiably extracted from the density and colour fields of the coarse model.
The output is then fine-tuned using a high-resolution latent diffusion model, which, after optimisation, generates high-quality 3D meshes with detailed textures.
The model also allows for prompt-based editing. That is, given a coarse model generated from a base text prompt, parts of the text can be modified by fine-tuning NeRF and 3D mesh models to obtain an edited 3D high-resolution mesh model.
Additionally, the Magic3D model also makes room for other editing capabilities wherein to a given input image, by fine-tuning the diffusion model with DreamBooth and optimising 3D models with the given prompts, it is ensured that the subject in the rendered 3D image carries maximum fidelity to the input image subject.
Using the stylistic transfer capabilities of eDiffi, NVIDIA’s text-to-image diffusion model, the input image could also be made to transfer its style to the output 3D model.
NVIDIA Corporation, known for its hardware prowess, has found a strong foothold in the generative AI front, even amidst relentless competition by large technology companies like Microsoft, Google, and Meta, who have been actively working on integrating their platforms with cutting-edge AI models.