Text-to-image generation is having its moment in AI. Just a month back, OpenAI announced DALL.E 2, a successor to their own DALL.E, which could generate highly realistic images around a few prompts. DALL.E 2 was a breakthrough among text-to-image generators. It could draw images that sounded implausible in a range of different styles – like a photograph or a painting. The tool stretched the limits of human imagination and could produce a scene within seconds. But it appears that DALL.E 2 will have to pass the baton on to GoogleAI’s Imagen, which was released earlier this week.
Google Research’s Brain Team published a paper titled ‘Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding’ along with the announcement. The team tested the model and compared it to a bunch of other text-to-image competitors like DALL.E 2, CLIP and VQ-GAN+CLIP. Demonstrating higher photorealism, Imagen outclassed its rivals by a mile.
Imagen used a frozen T5-XXL encoder to map the input text into embeddings and an image diffusion model, which then flows into two super-resolution diffusion models, which then generate the images.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Source: GoogleAI blog, How Imagen uses the text encoder to produce images

Source: Twitter
Comparison with GLIDE and DALL.E 2
The study has introduced a new set of categories for text prompts called DrawBench to evaluate the quality of text-to-image models. The new benchmark is comprehensive and judges the models’ many aspects, like compositionality, cardinality and spatial relationships, on the basis of 11 parameters. These parameters are divided into different colours, the number of objects in the scene, any text in the scene and interaction between the objects.
DrawBench also constantly uses prompts that are more complicated and creative or with words that are rarely used so that the model is well-acquainted with these commands. This also pushes the model’s ability to generate imagery that is more imaginative and outlandish.
Imagen outperformed DALL.E 2, GLIDE, VQ-GAN, which works using CLIP, and Latent Diffusion, on scores given by human raters. Imagen tested better than the other models by a wide margin on accounts of both image fidelity and image-text alignment.
Source: Research, Comparison between Imagen and DALL.E 2, GLIDE, VQGAN+CLIP and Latent Diffusion on DrawBench
Source: Research, Comparison between Imagen and DALL.E 2, GLIDE etc. on COCO
The models were also tested on the basis of the standard benchmark to judge text-to-image models, the COCO evaluation set. The main metric to measure the models’ performance was FID, or Frechet Inception Distance, which is used to calculate image fidelity, image-text alignment and the CLIP score. For both DrawBench and COCO benchmarks, evaluation was done by human raters because FID can’t fully judge perceptual quality, and CLIP is limited at counting.
Source: Research, Random Imagen samples from across categories drawn from prompts by DrawBench
The human testers were then given a reference image, and an image generated by the model and asked to choose the more photorealistic image between the two. The percentage was calculated on the number of times people preferred the model generated image over the reference image. For text-image alignment, people were asked to rate both the reference image and the model generated image over how well they matched the caption.
While Imagen wasn’t trained on COCO, it scored state-of-the-art zero-shot FID at 7.27, which was higher than DALL.E 2 and other models that were trained on COCO. In zero-shot FID, Imagen scored -30K, for which 30K prompts were pulled randomly from the benchmark set. The samples generated by the model for these prompts were compared with reference images from the full set. It also achieved a preference rate of 39.2 per cent over other models in photorealism. In scenes with no people, however, Imagen’s preference rate went up to 43.6 per cent, which showed that the model’s ability is limited when it came to generating photorealistic images of people. The COCO test doesn’t offer deep insight into the differences between the models, which is why DrawBench was developed.
How did Imagen beat DALL.E 2?
- There are a number of things that Imagen did differently to surpass other text-to-image models, like the remarkable DALL.E 2. Imagen was trained using Google AI’s largest text encoder, T5-XXL, which has 4.6 billion parameters. The research essentially shows that scaling the size of the text encoder improves text-image alignment and image fidelity to a great extent. In fact, it proves that scaling the size of the pretrained text encoder is much more useful than scaling the size of the diffusion model. While scaling the size of the diffusion model U-Net results in improved sample quality, a bigger text encoder has a greater overall impact.
- The study also introduced the concept of dynamic thresholding, a new diffusion sampling technique, which is done at each sampling step to prevent pixels from saturating. The method made the images appear more photorealistic, especially in the case of large classifier-free guidance weights in samples.
- Imagen also used another diffusion technique called noise conditioning augmentation that helps models become aware of the amount of noise that has been added and consequently makes them more robust. The technique led to better image fidelity and contributed to Imagen’s higher FID and CLIP scores.
- Imagen employed the U-Net architecture for the base 64*64 diffusion model and modified it in some ways to make it more efficient. The U-Net architecture uses less memory, converges faster and has better sample quality with faster inference time.
Source: Twitter
Source: LinkedIn
However, just like DALL.E 2, Imagen has also not been released to the public and with good reason. The study noted that the risks associated with biases in large language models still remain, and training text-to-image generators on these datasets reproduced the dangerous stereotypes. Thomas Wolf, co-founder and Chief Strategy Officer at Hugging Face, commented on this regard, stating how it hindered research in the text-to-image area. Wolf suggested that it was possible and even preferred if datasets could be made public so they could be improved with a collective effort.