MITB Banner

OpenAI Brings Out GLIDE, Outperforms Its Own DALL-E

GLIDE (Guided Language to Image Diffusion for Generation and Editing) is a 3.5 billion parameter text-to-image generation model
Share

Tech innovator OpenAI has decided to say goodbye to 2021 with a bang with the release of GLIDE (Guided Language to Image Diffusion for Generation and Editing), a new 3.5 billion parameter text-to-image generation model that is even better than DALL-E. At the beginning of 2021, it released DALL-E, a 12-billion parameter version of GPT-3 trained to generate images from text descriptions using a dataset of text-image pairs. For GLIDE, it has trained a smaller model on a filtered dataset and released the code and weights.

The paper released by OpenAI said the researchers found that samples from the model they generated with classifier-free guidance are both photorealistic and reflect a diverse range of world knowledge. It added that the samples they generated were preferred to those from DALL-E 87% of the time when evaluated for photorealism and 69% of the time when evaluated for caption similarity by human judges.

CLIP and classifier-free guidance

In the paper released titled, “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models“, the researchers said that they trained the 3.5 billion parameter diffusion model that uses a text encoder to condition on natural language descriptions. Then, they proceeded to compare CLIP guidance and classifier-free guidance for guiding diffusion models towards text prompts. CLIP (Contrastive Language-Image Pretraining) is a neural network architecture for Learning Transferable Visual Models From Natural Language Supervision. The researchers went on to find that classifier-free guidance yields higher quality images using human and automated evaluations.

They also said they provided the model with editing capabilities along with zero-shot generation. This allows humans to iteratively improve model samples until they match more complex prompts. The team also fine-tuned the model to perform image inpainting. Edits produced by the model match the style and lighting of the surrounding context, including convincing shadows and reflections.

Image: GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

How GLIDE was trained

The paper said that the researchers trained a 3.5 billion parameter text-conditional diffusion model at 64 × 64 resolution and one more 1.5 billion parameter text-conditional upsampling diffusion model to increase the resolution to 256 × 256. They trained a noise-aware 64 × 64 ViT-L CLIP model for CLIP guidance. For text conditioning, they encoded it into a sequence of K tokens and fed these tokens into a Transformer model. This output is used for two things – the final token embedding is used in place of a class embedding in the ADM model, the last layer of token embeddings separately projected to the dimensionality of each attention layer throughout the ADM model, and then concatenated to the attention context at each layer, as per the paper. 

The model is trained on the same dataset as DALL-E, using the same model architecture. It is scaled to 512 channels and 24 residual blocks of width 2048, used for text encoding, creating 1.2 billion parameters.

Inpainting

The researchers also fine-tuned the model to perform inpainting during which random regions of training examples are erased, and the remaining portions are fed into the model along with a mask channel as additional conditioning information.

OpenAI said it “trained noise-aware CLIP models with an image encoder fi(xt, t) that receives noised images xt and is otherwise trained with the same objective as the original CLIP model.”

GLIDE Vs DALL-E

GLIDE is compared against DALL-E using our human evaluation protocol. Three sets of comparisons between DALL-E and GLIDE were done, as per the paper:

  • Both models are compared when using no CLIP reranking
  • Use CLIP reranking only for DALL-E
  • CLIP reranking is used for DALL-E and also to project GLIDE samples through the discrete VAE used by DALL-E.
Image: GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

The team said that the evaluations were done using two temperatures for the DALL-E model. The results show that GLIDE is preferred by human evaluators in all settings.

PS: The story was written using a keyboard.
Share
Picture of Sreejani Bhattacharyya

Sreejani Bhattacharyya

I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at sreejani.bhattacharyya@analyticsindiamag.com
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India