Tech innovator OpenAI has decided to say goodbye to 2021 with a bang with the release of GLIDE (Guided Language to Image Diffusion for Generation and Editing), a new 3.5 billion parameter text-to-image generation model that is even better than DALL-E. At the beginning of 2021, it released DALL-E, a 12-billion parameter version of GPT-3 trained to generate images from text descriptions using a dataset of text-image pairs. For GLIDE, it has trained a smaller model on a filtered dataset and released the code and weights.
The paper released by OpenAI said the researchers found that samples from the model they generated with classifier-free guidance are both photorealistic and reflect a diverse range of world knowledge. It added that the samples they generated were preferred to those from DALL-E 87% of the time when evaluated for photorealism and 69% of the time when evaluated for caption similarity by human judges.
Sign up for your weekly dose of what's up in emerging technology.
CLIP and classifier-free guidance
In the paper released titled, “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models“, the researchers said that they trained the 3.5 billion parameter diffusion model that uses a text encoder to condition on natural language descriptions. Then, they proceeded to compare CLIP guidance and classifier-free guidance for guiding diffusion models towards text prompts. CLIP (Contrastive Language-Image Pretraining) is a neural network architecture for Learning Transferable Visual Models From Natural Language Supervision. The researchers went on to find that classifier-free guidance yields higher quality images using human and automated evaluations.
They also said they provided the model with editing capabilities along with zero-shot generation. This allows humans to iteratively improve model samples until they match more complex prompts. The team also fine-tuned the model to perform image inpainting. Edits produced by the model match the style and lighting of the surrounding context, including convincing shadows and reflections.
How GLIDE was trained
The paper said that the researchers trained a 3.5 billion parameter text-conditional diffusion model at 64 × 64 resolution and one more 1.5 billion parameter text-conditional upsampling diffusion model to increase the resolution to 256 × 256. They trained a noise-aware 64 × 64 ViT-L CLIP model for CLIP guidance. For text conditioning, they encoded it into a sequence of K tokens and fed these tokens into a Transformer model. This output is used for two things – the final token embedding is used in place of a class embedding in the ADM model, the last layer of token embeddings separately projected to the dimensionality of each attention layer throughout the ADM model, and then concatenated to the attention context at each layer, as per the paper.
The model is trained on the same dataset as DALL-E, using the same model architecture. It is scaled to 512 channels and 24 residual blocks of width 2048, used for text encoding, creating 1.2 billion parameters.
The researchers also fine-tuned the model to perform inpainting during which random regions of training examples are erased, and the remaining portions are fed into the model along with a mask channel as additional conditioning information.
OpenAI said it “trained noise-aware CLIP models with an image encoder fi(xt, t) that receives noised images xt and is otherwise trained with the same objective as the original CLIP model.”
GLIDE Vs DALL-E
GLIDE is compared against DALL-E using our human evaluation protocol. Three sets of comparisons between DALL-E and GLIDE were done, as per the paper:
- Both models are compared when using no CLIP reranking
- Use CLIP reranking only for DALL-E
- CLIP reranking is used for DALL-E and also to project GLIDE samples through the discrete VAE used by DALL-E.
The team said that the evaluations were done using two temperatures for the DALL-E model. The results show that GLIDE is preferred by human evaluators in all settings.