Listen to this story
Of late, AI artists and artwork have been mushrooming rapidly. Platforms like Ultraleap-backed, ‘Midjourney’, OpenAI’s, ‘DALL-E 2’, Meta’s, ‘Make-A-Scene’, Hugging Face’s, ‘DALL-E Mini’ (now ‘Craiyon🖍’) and others are redefining the imagination of design and visualisation as we know it. However, the majority of these platforms provide access to users on an invite-only basis.
A free open source softwares (FOSS) that recently gained popularity is ‘Disco Diffusion’, a CLIP-Guided Diffusion model that can be used to convert text-to-image using a compilation of words called ‘prompts’, and having it search databases to interpret the look. The latest version (v5.6) comes with an additional feature of portrait generator.
Created by Somnai and augmented by Gandamu, the new generative adversarial network (GAN) code is hosted on Google Colab Notebook. The model is as flexible as VQGAN ImageNET and WikiArt models in creating vibrant pieces.
The diffusion model is a model of the cognitive processes involved in simple two-choice decisions. It is the process of removing noise from an image for better resolution.
First proposed in 2015, a renewed interest in diffusion models was observed recently, owing to their training stability and promising sample quality results on audio and visual generation. They offer potentially favourable results compared to other deep generative models.
Diffusion models work by altering the training data with the addition of Gaussian noise, gradually removing the details in the data set till it becomes pure noise, and then training a neural network to reverse this corruption process. Running this reversed corruption process synthesises data from pure noise by slowly reducing noise to produce a clean sample.
The process can be interpreted as an ‘optimisation algorithm’ that follows the gradient of the data density to produce likely samples.
Google’s latest research leaps toward resolving the diffusion models’ image resolution issue through linking SR3 and CDM. Adding a unique data set and widening the model now helps produce better results compared to the existing models.
The SR3 is a super-resolution diffusion model which takes low-resolution as input and constructs a corresponding high-resolution image from complete noise. It uses the image destruction process for training.
CDM is a type-condition diffusion model trained using ImageNet data to create high-resolution images. As ImageNet is a highly complex data set, researchers concatenate multiple diffusion models to build CDM.
The researchers mentioned that this method could link multiple generative models that span several spatial resolutions together and then generate a diffusion model of low-resolution data, followed by a series of SR3 high-resolution diffusion models.
The realistic samples generated by CDM are used to evaluate the Fréchet Inception Distance (FID) score and classification accuracy score of the image quality created by the developed model.
Overall, the ultra-high-resolution images generated by SR3 surpass GAN in human evaluation. Moreover, both greatly exceed the current top methods, BigGAN-deep and VQ-VAE-2.
With SR3 and CDM, the performance of diffusion models has been pushed to the state-of-the-art on super-resolution and class-conditional ImageNet generation benchmarks.
The process of creating paintings by ‘Disco Diffusion’ can be broadly divided into the following steps:
- Open the program
- Set parameters such as the image size, the number of process maps, and the number of generated images
- Write crisp prompts in English, start running and then wait for the AI to calculate and produce the painting
The generated pieces can be located in the user’s ‘Google Drive’.
Not just pictures
YouTube creator ‘DoodleChaos’ created a full-length music video using Disco Diffusion V5.2 Turbo.
In the description, he explains that he added keyframes for camera motion throughout the generated motion picture and manually synchronised it to the beat.
Furthermore, he specified the changes to the art style at different song moments. Since many of the lyrics are non-specific, even a human illustrator would have difficulty representing it visually. To make the lyrics more comprehensible for the AI, he modified them to be more programme coherent, such as specifying a setting.
Useful resources for Diffusion models
Zippy’s Disco Diffusion Cheatsheet v0.3 presents every setting for Disco Diffusion in layman’s terms.
Disco Diffusion Modifiers by weirdwonderfulai.art consists of modifiers, like artist names, which are keywords that guide the image generation in a certain direction.
Disco Diffusion 70+ Artist Studies also by weirdwonderfulai.art has centralised samples of generated art for 600+ artists. These contributions were made by many others experimenting with generating art and submitting their finds.
Development in the domain
Meta’s recent AI concept, ‘Make-a-scene’ generates imagery using text plus simple sketching.
“Make-A-Scene empowers people to create images using text prompts and freeform sketches. Prior image-generating AI systems typically used text descriptions as input, but the results could be difficult to predict”, according to Meta.