Listen to this story
DALL-E 2 was one of the hottest transformer-based models in 2022, but OpenAI just released a brother to this highly capable diffusion model. In a paper submitted on 16th December, the OpenAI team described Point-E, a method for generating 3D point clouds from complex text prompts.
With this, AI enthusiasts can move beyond text-to-2D-image and generatively synthesize 3D models with text. The project has also been open-sourced on Github, as well as the model’s weights for various numbers of parameters.
The model is just one of the parts that make the solution work. The crux of the paper lies in the method proposed for creating 3D objects through a diffusion method that works on point clouds. The algorithm was created with a focus on virtual reality, gaming, and industrial design, as it can generate 3D objects up to 600x faster than current methods.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
There are two ways that text-to-3D models currently work. The first is to train generative models on data which has 3D object to text pairing. This results in incapability to understand more complex prompts as well as issues with 3D datasets. The second approach is to leverage text-image models to optimize the creation of 3D representations of the prompt.
Point- E combines traditional methods of training algorithms for text-to-3D synthesis. Using two separate models paired together, Point-E can cut down on the amount to create a 3D object. The first set of algorithms is a text-to-image model, likely DALL-E 2, which can create an image of the prompt given by the user. This image is then used as a base for the second model, which converts the image into a 3D object.
The OpenAI team created a dataset of several million 3D models, which they then exported through Blender. These renders were then processed to extract the image data as a point cloud, which is a way of denoting the density of composition of the 3D object. After further processing, such as removing flat objects and clustering by CLIP features, the dataset was ready to be fed into the View Synthesis GLIDE model.
The researchers then created a new method for point cloud diffusion by representing the point cloud as a tensor of a shape. These tensors are then whittled down from a random shape to the shape of the required 3D object through progressive denoising. The output from this diffusion model is then run through a point cloud upsampler that improves the quality of the final output. For compatibility with common 3D applications, the point clouds are then converted into meshes using Blender.
These meshes can then be used in games, metaverse applications, or other 3D intensive tasks like post processing for movies. While DALL-E has already revolutionized the text-to-image generation process, Point-E aims to do the same for the 3D space. Creating on-demand 3D objects and shapes fast is an important step towards generating 3D landscapes using artificial intelligence.