A big chunk of back to back releases in the large language model space this year has been in the large text-to-image generation models niche as well. 2021 started with OpenAI releasing DALL-E, a 12-billion parameter version of GPT-3 trained to generate images from text descriptions by using a dataset of text-image pairs. Just a few days back, the company released GLIDE (Guided Language to Image Diffusion for Generation and Editing), a new 3.5 billion parameter text-to-image generation model that outperformed DALL·E. Around a month back, NVIDIA, which has been actively working in this area for quite some time, revealed the sequel to its GauGAN model, the GauGAN2, which allows users to create real landscape photos. It can convert words to photographic-quality images that one can then alter.
With the battle between large text-to-image generation models heating up, let’s see what GLIDE and GauGAN2 bring to the table.
Sign up for your weekly dose of what's up in emerging technology.
NVIDIA says it combines multiple modalities such as text, semantic segmentation, sketch and style within a single GAN framework. This allows turning an artist’s vision into a high-quality AI-generated image.
With GauGAN2, the users can enter a brief phrase to quickly generate an image’s key features and theme. NVIDIA gives the example of a snow-capped mountain range, which can then be customised with sketches to make a specific mountain taller, add a couple of trees or add other customisations in the foreground, or clouds in the sky.
In the paper released titled, “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models“, the researchers at OpenAI said that for GLIDE, they trained the diffusion model that uses a text encoder to condition on natural language descriptions. After that, they proceeded to compare CLIP guidance and classifier-free guidance for guiding diffusion models towards text prompts. The researchers went on to find that classifier-free guidance yields higher quality images using human and automated evaluations.
They also fine-tuned the model to perform inpainting during which random regions of training examples are erased, and the remaining portions are fed into the model along with a mask channel as additional conditioning information.
GauGAN2: 10 million high-quality landscape images
The GauGAN2 was trained on 10 million high-quality landscape images using the NVIDIA Selene supercomputer, an NVIDIA DGX SuperPOD system, says NVIDIA. The team used a neural network that learns the connection between words and the visuals they correspond to, like “winter,” “foggy”, or “rainbow.”
GLIDE: 3.5 billion parameter text-conditional diffusion model
The researchers trained a 3.5 billion parameter text-conditional diffusion model at 64 × 64 resolution and one more 1.5 billion parameter text-conditional upsampling diffusion model to increase the resolution to 256 × 256. They trained a noise-aware 64 × 64 ViT-L CLIP model for CLIP guidance. For text conditioning, they encoded it into a sequence of K tokens and fed these tokens into a Transformer model. This output is used for two things. The final token embedding is used in place of a class embedding in the ADM model. The last layer of token embeddings is separately projected to each attention layer’s dimensionality throughout the ADM model and then concatenated to the attention context at each layer, as per the paper.
The neural network behind GauGAN2 produces a larger variety and higher quality of images, compared to state-of-the-art models specifically for text-to-image or segmentation map-to-image applications, said NVIDIA.
GLIDE is compared against DALL·E using human evaluation protocol. Three sets of comparisons between DALL·E and GLIDE were done, said the paper:
- both models were compared when using no CLIP reranking,
- CLIP reranking was used only for DALL·E
- CLIP reranking was used for DALL·E as well as to project GLIDE samples through the discrete VAE used by DALL·E.
The team said that the evaluations were done using two temperatures for the DALL·E model. The results show that human evaluators prefer GLIDE in all settings.