Listen to this story
DALL-E 2, Midjourney, and Stable Diffusion, the beasts of generative AI were the highlights of 2022. Input your text prompt, and the models would generate the desired art within minutes, if not seconds. Safe to say that these are still touted as one of the greatest breakthroughs of AI in recent times.
These text-to-image generative models work on the diffusion method, which work on probabilistic estimation methods. For image generation, this means adding noise to an image, and then denoising it, while applying different parameters along the way to guide and mould it for the output. This is further called ‘Denoising Diffusion Models’.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
The concept of generating images using diffusion models originates from the world of physics, more specifically non-equilibrium thermodynamics, which deals with the compression and spread of fluids and gases based on energy. Let’s look at how exactly the researchers got the inspiration and technique for image generation right by understanding something outside of machine learning.
Uniformity of Noise
To begin with an example, if we put a small drop of red paint in a glass of water, initially it will look like a blob of red in the water. Eventually, the drop will start spreading and gradually turn the whole colour of the water pale red or add a reddish tint to the glass of water.
In the probabilistic estimation method, if you want to estimate the probability of finding a molecule of red paint anywhere in the glass of water, you have to start by sampling the probability of the colour starting from the first time it touches the water and starts spreading. This is a complex state and it is very hard to track. But when the colour is completely spread in the water, it turns pale red. This gives a uniform distribution of the colour and therefore is comparatively easier to calculate using a mathematical expression.
Non-equilibrium thermodynamics can track each step of this spreading and diffusion process, and understand it to reverse it with small steps into the original complex state. Reverse the pale red glass of water back into clear water, with a drop of red paint.
In 2015, Jascha Sohl-Dickstein used this principle of diffusion from physics and used it in generative modelling. Diffusion methods for generating images start with converting the training data (red colour) with a set of complex images and turning them into noise (pale red glass of water). Then, the machine is trained to reverse the process to convert the noise into images.
You can read the paper: Deep Unsupervised Learning using Nonequilibrium Thermodynamics
In his work, Sohl-Dickstein explains the process of creating the model. The algorithm starts with picking an image from the training dataset and starts adding noise to it, step by step. Each pixel of the image has a value and is now part of a million-dimensional space. With added noise, each pixel starts disassociating itself from the original image. Follow this for all the images in the dataset, and the space becomes a simple noise box. This process of converting images into a box of noise is the forward-process.
Now, to make this into a generative model comes the neural network part. Take the box of noise and feed it to the trained machine to predict the images that came one step earlier and had less noise. Along the way, the model has to be fine-tuned by tweaking the parameters to eventually turn the noise into an image that represents something similar to the original complex input data.
The final trained network does not need any more input data and can generate images directly from the sample image distribution (noise) into images that resemble the training dataset.
Story Behind Diffusion
These diffusion models were generating images but were still miles behind GANs in terms of the quality and speed. There was still a lot of work to be done to reach the likes of DALL-E.
In 2019, Yang Song, a doctoral student at Stanford, who had no knowledge of Sohl-Dickstein’s work, published his paper where he generated images using gradient estimation of the distribution instead of the probability distribution. The technique worked by adding noise to each image in the dataset and then predicting the original image through gradients of the distribution. The image quality that turned out through his method was several times better than earlier methods, but was painfully slow.
In 2020, Jonathan Ho, Ph.D graduate from University of California, was working on diffusion models and came across both—Solh Dickstein’s and Song’s research papers. Because of his interest in the field, even after his doctoral was completed, he continued working on diffusion models and thought that the combination of both the methods with advancement in the neural network through the years would make the trick.
To his delight, it worked! The same year, Ho published a paper titled, “Denoising Diffusion Probabilistic Models”, also commonly referred to as ‘DDPM’. The method surpassed all the previous image generation techniques in terms of quality and speed, including GANs. This led to the foundation of generative models like DALL-E, Stable Diffusion, and Midjourney.
The Missing Ingredient
Now that we had models that could generate images, linking them to text commands was the next important step—the prominent part of modern day generative models.
Large Language Models (LLMs) were also on the rise around the same time with BERT, GPT-3, and many others that are doing things similar to GANs and diffusion models, but with texts.
In 2021, Ho with his colleague Tim Salimans of Google Research, combined (LLMs) with image-generating diffusion models. This was possible because LLMs are similar to generative models that are trained on text, instead of images, from the internet and predict words learning from probability distribution. The combination was achieved by the process of guided diffusion, which meant guiding the process of diffusion by texts generated by LLMs.
These generative models, when guided with LLMs, led to these text-to-image models that generate images based on text inputs.