NVIDIA’s Text-to-Image Model eDiffi Completes the Picture

What eDiffi does differently is it trains a group of expert denoisers at different intervals of time during the whole process.
Listen to this story

AI text-to-image generators have become as commonplace as having an ‘opinion’ —if everyone has an opinion, every tech company worth its salt has its own AI text-to-image generator. All the big tech companies have one—Microsoft-backed OpenAI has ‘DALL.E 2’, Google has ‘Imagen’ and Meta has ‘Make-a-Scene’, while buzzy startups like Emad Mostaque’s have ‘Stable Diffusion’. Now, US semiconductor design giant NVIDIA has also entered the mix with its text-to-image model called ‘ensemble diffusion for images’ or ‘eDiffi’. However, eDiffi isn’t open to the public for use unlike Stable Diffusion and DALL.E 2 which are open source. 

Some old, some new

Diffusion models synthesise images through an iterative denoising process that slowly generates an image from random noise. Traditionally, diffusion models have a single model which is trained to denoise the entire noise distribution. What eDiffi does differently is that it trains a group of expert denoisers at different intervals of time during the whole process. NVIDIA released a research paper, along with the announcement, titled, ‘eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers’, which claimed that this simplified the sampling process. 

Denoising involves solving a reverse differential equation during which a denoising network is called several times. NVIDIA wanted the model to be easily scalable, which is harder when each denoising step adversely impacts the test-time and computational complexity of sampling. The study found that eDiffi’s model was able to achieve the scaling goal without eating into the test-time computational complexity. 


Sign up for your weekly dose of what's up in emerging technology.

Which model is the best?

The paper concluded that eDiffi had managed to outperform competitors like DALL.E 2, Make-a-Scene, GLIDE and Stable Diffusion on the basis of the Frechet Inception Distance, or FID—a metric to evaluate the quality of AI generated images. eDiffi achieved a FID score slightly higher than Google’s Imagen and Parti. However, while each upcoming model seems to better the previous one in terms of accuracy and quality, it must be noted that researchers cherry pick the examples to showcase their best illustrations

Download our Mobile App

The model’s best configuration was then compared with DALL.E 2 and Stable Diffusion, both of which are publicly available text-to-image generative models. The experiment found that the other models were mixing up attributes from different entities while ignoring some of the attributes. Meanwhile, eDiffi was able to correctly model attributes from all entities. 

(The first image is generated by Stable Diffusion, the second image is by DALL.E 2 and third image is by eDiffi)

When it came to generating text which has been a sticky spot for most text-to-image generators, both Stable Diffusion and DALL.E 2 tended to misspell or even ignore words while eDiffi was able to generate the text accurately. 

In the context of long descriptions, eDiffi was also shown to be able to handle long-range dependencies much better than DALL.E 2 and Stable Diffusion, which indicates that it has a longer memory than the other two. 

New features added

NVIDIA’s eDiffi uses a bunch of pretrained text encoders to give inputs to its text-to-image model. It uses a combination of the CLIP text encoder—which aligns the embedded text to the matching embedded image—along with the T5 text encoder—which performs language modelling. While older models like DALL.E 2 use only CLIP or Imagen uses the T5, eDiffi uses both encoders in the same model. 

This enables eDiffi to produce entirely different images even with the same text input. CLIP helps lend a stylised look to the generated images but the output normally misses out on the details in the text. On the other hand, images produced by T5 text embeddings produce better individual objects instead of a style. By using them together, eDiffi was essentially able to produce images with both qualities. 

The model was also tested on the usual datasets, like MS-COCO, which demonstrated that CLIP+T5 embeddings lead to much better trade-off curves than either used individually. On the visual genome dataset, it was proven that using T5 individual embeddings performed better than CLIP embeddings. The study finds that the more descriptive the text prompt is, the better T5 performs than CLIP. However, overall, a blend of the two worked best.

This allows eDiffi to have what it calls ‘style transfer’. In this process, a reference image can be used for style from which CLIP image embeddings are extracted and used as a style reference vector. Then, style conditioning is enabled in the second step, following which the model generates an image similar to the input style and caption. In the third step, style conditioning is disabled, following which images are generated in a natural style. 

The study also generated images produced solely using CLIP text embeddings and T5 text embeddings separately. Images generated by the former often contained correct objects in the foreground with blurry, fine-grain details while images generated by the latter showed incorrect objects at times. 

eDiffi also introduced a feature called ‘Paint with Words’ which helps users determine the location of the objects in the image by mentioning it in the text prompt as well as scribbling on the image itself. Users can select the phrase to specify the location within the prompt. The model is then able to produce an image that matches both the input map or sketch and the caption. 

More Great AIM Stories

Poulomi Chatterjee
Poulomi is a Technology Journalist with Analytics India Magazine. Her fascination with tech and eagerness to dive into new areas led her to the dynamic world of AI and data analytics.

AIM Upcoming Events

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Early Bird Passes expire on 10th Feb

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox