The Physics of AI Art: A Look at Diffusion Models

The modern text-to-image art generators are based on principle of physics and the story is quite interesting.
Listen to this story

DALL-E 2, Midjourney, and Stable Diffusion, the beasts of generative AI were the highlights of 2022. Input your text prompt, and the models would generate the desired art within minutes, if not seconds. Safe to say that these are still touted as one of the greatest breakthroughs of AI in recent times.

These text-to-image generative models work on the diffusion method, which work on probabilistic estimation methods. For image generation, this means adding noise to an image, and then denoising it, while applying different parameters along the way to guide and mould it for the output. This is further called ‘Denoising Diffusion Models’.

Read: Diffusion Models: From Art to State-of-the-art

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

The concept of generating images using diffusion models originates from the world of physics, more specifically non-equilibrium thermodynamics, which deals with the compression and spread of fluids and gases based on energy. Let’s look at how exactly the researchers got the inspiration and technique for image generation right by understanding something outside of machine learning. 

Uniformity of Noise

To begin with an example, if we put a small drop of red paint in a glass of water, initially it will look like a blob of red in the water. Eventually, the drop will start spreading and gradually turn the whole colour of the water pale red or add a reddish tint to the glass of water. 

In the probabilistic estimation method, if you want to estimate the probability of finding a molecule of red paint anywhere in the glass of water, you have to start by sampling the probability of the colour starting from the first time it touches the water and starts spreading. This is a complex state and it is very hard to track. But when the colour is completely spread in the water, it turns pale red. This gives a uniform distribution of the colour and therefore is comparatively easier to calculate using a mathematical expression. 

Non-equilibrium thermodynamics can track each step of this spreading and diffusion process, and understand it to reverse it with small steps into the original complex state. Reverse the pale red glass of water back into clear water, with a drop of red paint.

In 2015,  Jascha Sohl-Dickstein used this principle of diffusion from physics and used it in generative modelling. Diffusion methods for generating images start with converting the training data (red colour) with a set of complex images and turning them into noise (pale red glass of water). Then, the machine is trained to reverse the process to convert the noise into images.

Jascha Sohl-Dickstein

You can read the paper: Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Diffusion Process

In his work, Sohl-Dickstein explains the process of creating the model. The algorithm starts with picking an image from the training dataset and starts adding noise to it, step by step. Each pixel of the image has a value and is now part of a million-dimensional space. With added noise, each pixel starts disassociating itself from the original image. Follow this for all the images in the dataset, and the space becomes a simple noise box. This process of converting images into a box of noise is the forward-process.

Now, to make this into a generative model comes the neural network part. Take the box of noise and feed it to the trained machine to predict the images that came one step earlier and had less noise. Along the way, the model has to be fine-tuned by tweaking the parameters to eventually turn the noise into an image that represents something similar to the original complex input data. 

The final trained network does not need any more input data and can generate images directly from the sample image distribution (noise) into images that resemble the training dataset. 

Story Behind Diffusion

These diffusion models were generating images but were still miles behind GANs in terms of the quality and speed. There was still a lot of work to be done to reach the likes of DALL-E. 

Yang Song

In 2019, Yang Song, a doctoral student at Stanford, who had no knowledge of Sohl-Dickstein’s work, published his paper where he generated images using gradient estimation of the distribution instead of the probability distribution. The technique worked by adding noise to each image in the dataset and then predicting the original image through gradients of the distribution. The image quality that turned out through his method was several times better than earlier methods, but was painfully slow. 

In 2020, Jonathan Ho, Ph.D graduate from University of California, was working on diffusion models and came across both—Solh Dickstein’s and Song’s research papers. Because of his interest in the field, even after his doctoral was completed, he continued working on diffusion models and thought that the combination of both the methods with advancement in the neural network through the years would make the trick. 

Jonathan Ho

To his delight, it worked! The same year, Ho published a paper titled, “Denoising Diffusion Probabilistic Models”, also commonly referred to as ‘DDPM’. The method surpassed all the previous image generation techniques in terms of quality and speed, including GANs. This led to the foundation of generative models like DALL-E, Stable Diffusion, and Midjourney. 

The Missing Ingredient

Now that we had models that could generate images, linking them to text commands was the next important step—the prominent part of modern day generative models.

Large Language Models (LLMs) were also on the rise around the same time with BERT, GPT-3, and many others that are doing things similar to GANs and diffusion models, but with texts.

In 2021, Ho with his colleague Tim Salimans of Google Research, combined (LLMs) with image-generating diffusion models. This was possible because LLMs are similar to generative models that are trained on text, instead of images, from the internet and predict words learning from probability distribution. The combination was achieved by the process of guided diffusion, which meant guiding the process of diffusion by texts generated by LLMs.

These generative models, when guided with LLMs, led to these text-to-image models that generate images based on text inputs.

Read: GANs in The Age of Diffusion Models

Mohit Pandey
Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry


Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.