Search

# The Physics of AI Art: A Look at Diffusion Models

The modern text-to-image art generators are based on principle of physics and the story is quite interesting.
 Listen to this story

DALL-E 2, Midjourney, and Stable Diffusion, the beasts of generative AI were the highlights of 2022. Input your text prompt, and the models would generate the desired art within minutes, if not seconds. Safe to say that these are still touted as one of the greatest breakthroughs of AI in recent times.

These text-to-image generative models work on the diffusion method, which work on probabilistic estimation methods. For image generation, this means adding noise to an image, and then denoising it, while applying different parameters along the way to guide and mould it for the output. This is further called ‘Denoising Diffusion Models’.

##### Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy

The concept of generating images using diffusion models originates from the world of physics, more specifically non-equilibrium thermodynamics, which deals with the compression and spread of fluids and gases based on energy. Let’s look at how exactly the researchers got the inspiration and technique for image generation right by understanding something outside of machine learning.

## Uniformity of Noise

To begin with an example, if we put a small drop of red paint in a glass of water, initially it will look like a blob of red in the water. Eventually, the drop will start spreading and gradually turn the whole colour of the water pale red or add a reddish tint to the glass of water.

In the probabilistic estimation method, if you want to estimate the probability of finding a molecule of red paint anywhere in the glass of water, you have to start by sampling the probability of the colour starting from the first time it touches the water and starts spreading. This is a complex state and it is very hard to track. But when the colour is completely spread in the water, it turns pale red. This gives a uniform distribution of the colour and therefore is comparatively easier to calculate using a mathematical expression.

Non-equilibrium thermodynamics can track each step of this spreading and diffusion process, and understand it to reverse it with small steps into the original complex state. Reverse the pale red glass of water back into clear water, with a drop of red paint.

In 2015,  Jascha Sohl-Dickstein used this principle of diffusion from physics and used it in generative modelling. Diffusion methods for generating images start with converting the training data (red colour) with a set of complex images and turning them into noise (pale red glass of water). Then, the machine is trained to reverse the process to convert the noise into images.

You can read the paper: Deep Unsupervised Learning using Nonequilibrium Thermodynamics

## Diffusion Process

In his work, Sohl-Dickstein explains the process of creating the model. The algorithm starts with picking an image from the training dataset and starts adding noise to it, step by step. Each pixel of the image has a value and is now part of a million-dimensional space. With added noise, each pixel starts disassociating itself from the original image. Follow this for all the images in the dataset, and the space becomes a simple noise box. This process of converting images into a box of noise is the forward-process.

Now, to make this into a generative model comes the neural network part. Take the box of noise and feed it to the trained machine to predict the images that came one step earlier and had less noise. Along the way, the model has to be fine-tuned by tweaking the parameters to eventually turn the noise into an image that represents something similar to the original complex input data.

The final trained network does not need any more input data and can generate images directly from the sample image distribution (noise) into images that resemble the training dataset.

## Story Behind Diffusion

These diffusion models were generating images but were still miles behind GANs in terms of the quality and speed. There was still a lot of work to be done to reach the likes of DALL-E.

In 2019, Yang Song, a doctoral student at Stanford, who had no knowledge of Sohl-Dickstein’s work, published his paper where he generated images using gradient estimation of the distribution instead of the probability distribution. The technique worked by adding noise to each image in the dataset and then predicting the original image through gradients of the distribution. The image quality that turned out through his method was several times better than earlier methods, but was painfully slow.

In 2020, Jonathan Ho, Ph.D graduate from University of California, was working on diffusion models and came across both—Solh Dickstein’s and Song’s research papers. Because of his interest in the field, even after his doctoral was completed, he continued working on diffusion models and thought that the combination of both the methods with advancement in the neural network through the years would make the trick.

To his delight, it worked! The same year, Ho published a paper titled, “Denoising Diffusion Probabilistic Models”, also commonly referred to as ‘DDPM’. The method surpassed all the previous image generation techniques in terms of quality and speed, including GANs. This led to the foundation of generative models like DALL-E, Stable Diffusion, and Midjourney.

## The Missing Ingredient

Now that we had models that could generate images, linking them to text commands was the next important step—the prominent part of modern day generative models.

Large Language Models (LLMs) were also on the rise around the same time with BERT, GPT-3, and many others that are doing things similar to GANs and diffusion models, but with texts.

In 2021, Ho with his colleague Tim Salimans of Google Research, combined (LLMs) with image-generating diffusion models. This was possible because LLMs are similar to generative models that are trained on text, instead of images, from the internet and predict words learning from probability distribution. The combination was achieved by the process of guided diffusion, which meant guiding the process of diffusion by texts generated by LLMs.

These generative models, when guided with LLMs, led to these text-to-image models that generate images based on text inputs.

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### AI Assists Production in Indian Film Industry

Implementing AI in pre-production can bring down storyboarding process time by 50-80% and reduce the

### Is GPT-4 Really Better than Radiologists?

“Radiology report summaries created by GPT-4 are comparable, and in some cases, even preferred over

### TSMC: The Wizard Behind AI’s Curtain

TSMC anticipates a substantial CAGR of nearly 50% in the AI sector from 2022 to 2027.

Not really.

### Google Gemini To Arrive Sooner Than Expected

This is after announcing the AI at the Google I/O 2023, the company had postponed

### ByteDance to Launch Platform to Build Custom Chatbots

This comes just a few days after OpenAI had delayed its plan to launch a

### This New AI tool Could Mark the Beginning of the End for TikTok and Instagram Influencers

Alibaba Group announces a model framework that can transform still images into dynamic character videos

### Embracing Identity: The Journey of Sujoy Das

“Why is it that corporate diversity efforts are often limited to specific times of the

### The Biggest Data Breaches of 2023

The most significant breaches that impacted the global landscape in 2023.

### NVIDIA Planning Big Expansions in Japan

Prime Minister Fumio Kishida has extended billions of dollars in financial support to bolster TSMC