Last updated December 27, 2021
In AI Origins & Evolution

OpenAI Brings Out GLIDE, Outperforms Its Own DALL-E

GLIDE (Guided Language to Image Diffusion for Generation and Editing) is a 3.5 billion parameter text-to-image generation model

Published on December 27, 2021

by Sreejani Bhattacharyya

Tech innovator OpenAI has decided to say goodbye to 2021 with a bang with the release of GLIDE (Guided Language to Image Diffusion for Generation and Editing), a new 3.5 billion parameter text-to-image generation model that is even better than DALL-E. At the beginning of 2021, it released DALL-E, a 12-billion parameter version of GPT-3 trained to generate images from text descriptions using a dataset of text-image pairs. For GLIDE, it has trained a smaller model on a filtered dataset and released the code and weights.

The paper released by OpenAI said the researchers found that samples from the model they generated with classifier-free guidance are both photorealistic and reflect a diverse range of world knowledge. It added that the samples they generated were preferred to those from DALL-E 87% of the time when evaluated for photorealism and 69% of the time when evaluated for caption similarity by human judges.

CLIP and classifier-free guidance

In the paper released titled, “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models“, the researchers said that they trained the 3.5 billion parameter diffusion model that uses a text encoder to condition on natural language descriptions. Then, they proceeded to compare CLIP guidance and classifier-free guidance for guiding diffusion models towards text prompts. CLIP (Contrastive Language-Image Pretraining) is a neural network architecture for Learning Transferable Visual Models From Natural Language Supervision. The researchers went on to find that classifier-free guidance yields higher quality images using human and automated evaluations.

They also said they provided the model with editing capabilities along with zero-shot generation. This allows humans to iteratively improve model samples until they match more complex prompts. The team also fine-tuned the model to perform image inpainting. Edits produced by the model match the style and lighting of the surrounding context, including convincing shadows and reflections.

Image: GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

How GLIDE was trained

The paper said that the researchers trained a 3.5 billion parameter text-conditional diffusion model at 64 × 64 resolution and one more 1.5 billion parameter text-conditional upsampling diffusion model to increase the resolution to 256 × 256. They trained a noise-aware 64 × 64 ViT-L CLIP model for CLIP guidance. For text conditioning, they encoded it into a sequence of K tokens and fed these tokens into a Transformer model. This output is used for two things – the final token embedding is used in place of a class embedding in the ADM model, the last layer of token embeddings separately projected to the dimensionality of each attention layer throughout the ADM model, and then concatenated to the attention context at each layer, as per the paper.

The model is trained on the same dataset as DALL-E, using the same model architecture. It is scaled to 512 channels and 24 residual blocks of width 2048, used for text encoding, creating 1.2 billion parameters.

Inpainting

The researchers also fine-tuned the model to perform inpainting during which random regions of training examples are erased, and the remaining portions are fed into the model along with a mask channel as additional conditioning information.

OpenAI said it “trained noise-aware CLIP models with an image encoder fi(xt, t) that receives noised images xt and is otherwise trained with the same objective as the original CLIP model.”

GLIDE Vs DALL-E

GLIDE is compared against DALL-E using our human evaluation protocol. Three sets of comparisons between DALL-E and GLIDE were done, as per the paper:

Both models are compared when using no CLIP reranking
Use CLIP reranking only for DALL-E
CLIP reranking is used for DALL-E and also to project GLIDE samples through the discrete VAE used by DALL-E.

Image: GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

The team said that the evaluations were done using two temperatures for the DALL-E model. The results show that GLIDE is preferred by human evaluators in all settings.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

The Impact of Lok Sabha Election on India’s AI Progress

Vidyashree Srinivas

The BJP aims to safeguard citizen safety and privacy, leaning towards regulation, while the Congress views AI advancements as an opportunity to create jobs.