Google AI has introduced two connected approaches to enhance the image synthesis quality for diffusion models: Super-Resolution via Repeated Refinements (SR3) and a model for class-conditioned synthesis, called Cascaded Diffusion Models (CDM).
Researchers including Jonathan Ho and Chitwan Saharia, Google Research, Brain Team, worked towards scaling up diffusion models and with carefully selected data augmentation techniques, the duo was able to outperform existing approaches including GANs for image synthesis.
GANs is the state-of-the-art on most image generation tasks as measured by sample quality metrics such as FID, Inception Score and Precision. However, GANs capture less diversity than state-of-the-art likelihood-based models.
Sign up for your weekly dose of what's up in emerging technology.
Additionally, GANs are difficult to train, collapsing without carefully selected hyperparameters and regularisers. Considering this fact, a lot of effort has gone into developing likelihood-based models with GAN-like sample quality. Diffusion models are a type of likelihood-based model (originally proposed in 2015) that has recently been shown to produce high-quality images.
How models work
SR3 – the first model – is a super-resolution diffusion model that takes a low-resolution image as input and generates a high-resolution image. First, the model is trained using an image corruption process in which noise is gradually added to a high-resolution image until all that is left is pure noise. It then learns to reverse the process, starting with pure noise and gradually eliminating noise until it reaches a target distribution through the guidance of the supplied low-resolution image.
Download our Mobile App
Further, with large scale training – SR3 attains strong benchmark results on the super-resolution challenge for natural images and faces when scaling to resolutions four to eight times that of the initial low-resolution image. The model results were tested against state-of-the-art face super-resolution methods — PULSE and FSRGAN. Subjects are provided with images and asked to inform which one they think is from a camera. The performance is measured through confusion rates (percentage of times subjects choose the model output over the reference image). Results are shown below.
Researchers use SR3 models for class conditional image generation. Another model – CDM, is a class-conditional diffusion model trained on ImageNet data for generating high-resolution natural images. The researchers created CDM as a cascade of multiple diffusion models. This cascade method entails chaining together multiple generative models over multiple spatial resolutions: one diffusion model generates data at a low resolution, followed by a series of SR3 super-resolution diffusion models that gradually increase the resolution of the generated image to the highest resolution.
In addition to these models, the researchers introduced a new data augmentation technique called conditioning augmentation. This includes Gaussian noise and Gaussian blur to prevent each super-resolution model from overfitting to its lower resolution conditioning input, resulting in higher resolution sample quality for CDM.
Another research, titled ‘Diffusion Models Beat GANs on Image Synthesis’ by Prafulla Dhariwal and Alex Nichol from OpenAI, has shown that diffusion models can achieve image sample quality superior to the generative models, but have some limitations.
As per the paper, although diffusion models are an extremely promising direction for generative modelling, they are still slower than GANs at sampling time. This is due to the use of multiple denoising steps. One of the promising works in this direction is from Luhman and Luhman. They explored a way to distil the DDIM sampling process into a single step model. However, the samples from the single-step model are still not better than GANs but are much better when compared with previous single-step likelihood-based models. Future work in this direction may open a path to completely close the sampling speed gap between diffusion models and GANs without sacrificing image quality.