OpenAI Brings Out GLIDE, Outperforms Its Own DALL-E

GLIDE (Guided Language to Image Diffusion for Generation and Editing) is a 3.5 billion parameter text-to-image generation model

Tech innovator OpenAI has decided to say goodbye to 2021 with a bang with the release of GLIDE (Guided Language to Image Diffusion for Generation and Editing), a new 3.5 billion parameter text-to-image generation model that is even better than DALL-E. At the beginning of 2021, it released DALL-E, a 12-billion parameter version of GPT-3 trained to generate images from text descriptions using a dataset of text-image pairs. For GLIDE, it has trained a smaller model on a filtered dataset and released the code and weights.

The paper released by OpenAI said the researchers found that samples from the model they generated with classifier-free guidance are both photorealistic and reflect a diverse range of world knowledge. It added that the samples they generated were preferred to those from DALL-E 87% of the time when evaluated for photorealism and 69% of the time when evaluated for caption similarity by human judges.

CLIP and classifier-free guidance

In the paper released titled, “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models“, the researchers said that they trained the 3.5 billion parameter diffusion model that uses a text encoder to condition on natural language descriptions. Then, they proceeded to compare CLIP guidance and classifier-free guidance for guiding diffusion models towards text prompts. CLIP (Contrastive Language-Image Pretraining) is a neural network architecture for Learning Transferable Visual Models From Natural Language Supervision. The researchers went on to find that classifier-free guidance yields higher quality images using human and automated evaluations.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

They also said they provided the model with editing capabilities along with zero-shot generation. This allows humans to iteratively improve model samples until they match more complex prompts. The team also fine-tuned the model to perform image inpainting. Edits produced by the model match the style and lighting of the surrounding context, including convincing shadows and reflections.

Image: GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models


Download our Mobile App



How GLIDE was trained

The paper said that the researchers trained a 3.5 billion parameter text-conditional diffusion model at 64 × 64 resolution and one more 1.5 billion parameter text-conditional upsampling diffusion model to increase the resolution to 256 × 256. They trained a noise-aware 64 × 64 ViT-L CLIP model for CLIP guidance. For text conditioning, they encoded it into a sequence of K tokens and fed these tokens into a Transformer model. This output is used for two things – the final token embedding is used in place of a class embedding in the ADM model, the last layer of token embeddings separately projected to the dimensionality of each attention layer throughout the ADM model, and then concatenated to the attention context at each layer, as per the paper. 

The model is trained on the same dataset as DALL-E, using the same model architecture. It is scaled to 512 channels and 24 residual blocks of width 2048, used for text encoding, creating 1.2 billion parameters.

Inpainting

The researchers also fine-tuned the model to perform inpainting during which random regions of training examples are erased, and the remaining portions are fed into the model along with a mask channel as additional conditioning information.

OpenAI said it “trained noise-aware CLIP models with an image encoder fi(xt, t) that receives noised images xt and is otherwise trained with the same objective as the original CLIP model.”

GLIDE Vs DALL-E

GLIDE is compared against DALL-E using our human evaluation protocol. Three sets of comparisons between DALL-E and GLIDE were done, as per the paper:

  • Both models are compared when using no CLIP reranking
  • Use CLIP reranking only for DALL-E
  • CLIP reranking is used for DALL-E and also to project GLIDE samples through the discrete VAE used by DALL-E.
Image: GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

The team said that the evaluations were done using two temperatures for the DALL-E model. The results show that GLIDE is preferred by human evaluators in all settings.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at sreejani.bhattacharyya@analyticsindiamag.com

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Council Post: Evolution of Data Science: Skillset, Toolset, and Mindset

In my opinion, there will be considerable disorder and disarray in the near future concerning the emerging fields of data and analytics. The proliferation of platforms such as ChatGPT or Bard has generated a lot of buzz. While some users are enthusiastic about the potential benefits of generative AI and its extensive use in business and daily life, others have raised concerns regarding the accuracy, ethics, and related issues.