OpenAI Brings Out GLIDE, Outperforms Its Own DALL-E

GLIDE (Guided Language to Image Diffusion for Generation and Editing) is a 3.5 billion parameter text-to-image generation model

Tech innovator OpenAI has decided to say goodbye to 2021 with a bang with the release of GLIDE (Guided Language to Image Diffusion for Generation and Editing), a new 3.5 billion parameter text-to-image generation model that is even better than DALL-E. At the beginning of 2021, it released DALL-E, a 12-billion parameter version of GPT-3 trained to generate images from text descriptions using a dataset of text-image pairs. For GLIDE, it has trained a smaller model on a filtered dataset and released the code and weights.

The paper released by OpenAI said the researchers found that samples from the model they generated with classifier-free guidance are both photorealistic and reflect a diverse range of world knowledge. It added that the samples they generated were preferred to those from DALL-E 87% of the time when evaluated for photorealism and 69% of the time when evaluated for caption similarity by human judges.

CLIP and classifier-free guidance

In the paper released titled, “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models“, the researchers said that they trained the 3.5 billion parameter diffusion model that uses a text encoder to condition on natural language descriptions. Then, they proceeded to compare CLIP guidance and classifier-free guidance for guiding diffusion models towards text prompts. CLIP (Contrastive Language-Image Pretraining) is a neural network architecture for Learning Transferable Visual Models From Natural Language Supervision. The researchers went on to find that classifier-free guidance yields higher quality images using human and automated evaluations.


Sign up for your weekly dose of what's up in emerging technology.

They also said they provided the model with editing capabilities along with zero-shot generation. This allows humans to iteratively improve model samples until they match more complex prompts. The team also fine-tuned the model to perform image inpainting. Edits produced by the model match the style and lighting of the surrounding context, including convincing shadows and reflections.

Image: GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

How GLIDE was trained

The paper said that the researchers trained a 3.5 billion parameter text-conditional diffusion model at 64 × 64 resolution and one more 1.5 billion parameter text-conditional upsampling diffusion model to increase the resolution to 256 × 256. They trained a noise-aware 64 × 64 ViT-L CLIP model for CLIP guidance. For text conditioning, they encoded it into a sequence of K tokens and fed these tokens into a Transformer model. This output is used for two things – the final token embedding is used in place of a class embedding in the ADM model, the last layer of token embeddings separately projected to the dimensionality of each attention layer throughout the ADM model, and then concatenated to the attention context at each layer, as per the paper. 

The model is trained on the same dataset as DALL-E, using the same model architecture. It is scaled to 512 channels and 24 residual blocks of width 2048, used for text encoding, creating 1.2 billion parameters.


The researchers also fine-tuned the model to perform inpainting during which random regions of training examples are erased, and the remaining portions are fed into the model along with a mask channel as additional conditioning information.

OpenAI said it “trained noise-aware CLIP models with an image encoder fi(xt, t) that receives noised images xt and is otherwise trained with the same objective as the original CLIP model.”


GLIDE is compared against DALL-E using our human evaluation protocol. Three sets of comparisons between DALL-E and GLIDE were done, as per the paper:

  • Both models are compared when using no CLIP reranking
  • Use CLIP reranking only for DALL-E
  • CLIP reranking is used for DALL-E and also to project GLIDE samples through the discrete VAE used by DALL-E.
Image: GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

The team said that the evaluations were done using two temperatures for the DALL-E model. The results show that GLIDE is preferred by human evaluators in all settings.

More Great AIM Stories

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM