Listen to this story
After DALL.E 2 gained massive popularity not just among the tech community but also artists, students, and other hobbyists, it was quite clear that text-to-image generator is the real deal. The positive response prompted others to develop their own versions of such tools, the best examples being Midjourney and DALL.E Mini (now called Craiyon). Google, one of the leading companies when it comes to AI research, also released its own version of the text-to-image generation tool. It was received with rave reviews when first launched; however, in the current scenario, the popularity of Imagen pales in comparison to the ones discussed above.
When Imagen proved to be better
Imagen was introduced by Google as a text-to-image diffusion model ‘with an unprecedented degree of photorealism’ and ‘deep level of language understanding.’ For this tool, Google’s team uses a generic language model – like T5 – that is pretrained on text-only corpora. This method helped the team develop a tool that was effective at encoding text for image synthesis. Increasing the size of the language model in Imagen boosts both the sample fidelity and image-text alignment more than increasing the image diffusion model size.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Imagen demonstrated superior results. It achieved a state-of-art FID score of 7.27 on COCO dataset without being trained on COCO. Google claims that human raters found the Imagen samples to be at par with the COCO data in image-text alignment. Google also announced Drawbench, a benchmark for text-to-image models. With this newly introduced benchmark, the team compared Imagen with recent methods like DALL.E 2, Latent Diffusion Models, and VQ-GAN+CLIP. Imagen outperformed the other models in terms of sample quality and image-text alignment.
Others tools are accessible
DALL.E 2’s popularity seems to be unsurpassable. This tool from OpenAI definitely had the first mover advantage. Its predecessor – DALL.E – was introduced in 2021 when text-to-image generation was a field relatively untouched. In July this year, OpenAI released DALL.E in Beta. With this, users can buy additional DALL.E credits for USD 15 for 460 images, over and above the monthly free credits. OpenAI has also allowed full usage rights to commercialise the images that they create with DALL.E – including the right to reprint, sell and merchandise.
Another popular text-to-image generation tool that created a lot of major positive buzz is Stable Diffusion. It is available for use via a web interface. A user would just need to log in and start generating images using text prompts. It is similar to DALL.E but has additional options to fine-tune the outcome. The Stable Diffusion can be run locally on the user system or in the cloud; it is expected to be released on GitHub in the coming days.
Another popular tool Midjourney, which is created by a research company that goes by the same name. The tool can be used on their Discord channel, but the number of free images is limited to 25. Once you surpass this limit, you would be required to pay USD 10/per month for 200 images or get a standard membership of USD 30 per month for unlimited use. Midjourney also allows corporate use of the generated images against a special enterprise membership.
Google is being cautious
In a blog announcing Imagen, Google reserved a portion for talking about the several ethical challenges facing this tool. They note that there are potential risks of misuse that arise in case of open-sourcing the code and demos. “At this time we have decided not to release code or a public demo. In future work we will explore a framework for responsible externalisation that balances the value of external auditing with the risks of unrestricted open-access,” the blog noted.
The team also confessed that the large data requirements of the model had them rely on web-scraped datasets, which were mostly uncurated. This approach helps in algorithmic advances but also perpetuates social stereotypes and harmful associations, especially to marginalised groups. For Imagen, Google utilised LAION-400M dataset. It is an open and freely accessible dataset that contains large portions of uncurated data. In fact, the official website of this dataset notes that the dataset was developed for research purposes and is ‘not meant for real-world production or application.’
As per Google, for the reasons mentioned above, Imagen carries a risk of furthering harmful stereotypes and representations, which makes it unfavourable for a public release till strong safeguards are in place.