Top 4 DALL.E alternatives, text-to-image generators

DALL·E 2 was preferred over DALL·E 1 for its caption matching and photorealism.

Creativity is just connecting things, said Steve Jobs: He was channelling his inner Einstein (coincidentally another Walter Isaacson muse), who had come up with ‘combinatory play’ to explain the inner workings of creative thought. OpenAI took the hint, and built a text-to-image generator, DALL.E. 

OpenAI has got creativity down to a science. Almost! The astronaut riding a horse in a photorealistic style or teddy bears mixing sparkling chemicals as mad scientist as a 1990s Saturday morning cartoon are good cases in point. The ultra-imaginative DALL.E has become the talk of the town in a short time. Below, we look at similar models making the rounds in the world of AI.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

DALL.E

In 2020, OpenAI introduced GPT-3 and, a year later, DALL.E, a 12 billion parameter model, built on GPT-3. DALL.E was trained to generate images from text descriptions, and the latest release, DALL.E 2, generates even more realistic and accurate images with 4x better resolution. The model takes natural language captions and uses a dataset of text-image pairings to create realistic images. Additionally, it can take an image and create different variations inspired by original images.

DALL.E leverages the ‘diffusion’ process to learn the relationship between images and text descriptions. In diffusion, it starts with a pattern of random dots and tracks it towards an image when it recognises aspects of it. Diffusion models have emerged as a promising generative modelling framework and push the state-of-the-art image and video generation tasks. The guidance technique is leveraged in diffusion to improve sample fidelity for images and photorealism. DALL.E is made up of two major parts: a discrete autoencoder that accurately represents images in compressed latent space and a transformer that learns the correlations between language and this discrete image representation. Evaluators were asked to compare 1,000 image generations from each model, and DALL·E 2 was preferred over DALL·E 1 for its caption matching and photorealism.

DALL-E is currently only a research project, and is not available in OpenAI’s API. 

DALL.E outputs for ‘an armchair in the shape of an avocado’

CLIP

Earlier, the OpenAI research team introduced an open-sourced text-image tool, CLIP. The neural network Contrastive Language-Image Pre-training was trained on 400 million pairs of images and text. The tool efficiently learns visual concepts from natural language supervision and can be applied for classification by providing the names of the visual categories to be recognised. In a paper introducing the model, the OpenAI research team wrote about CLIP’s ability to perform various tasks during pretraining, including object character recognition (OCR), geo-localisation, action recognition, and more. CLIP has proven to be highly efficient, flexible, and more generalised. Furthermore, it is far less expensive, given CLIP relies on text-image pair datasets already available on the internet. It can adapt to perform a broader range of visual classification tasks. 

RuDALL.E

ruDALL-E takes a short description and generates images based on them. The model understands a wide range of concepts and generates completely new images and objects that did not exist in the real world. The Russian take on OpenAI, ruDALL.E, is trained on ruGPT-3, which was trained on 600GB of Russian text. The Russian ruDALL.E model boasts 1.3 billion parameters and a YTTM text tokeniser with a dictionary of 16,000 tokens. It leverages a custom VQGAN model that converts an image into a sequence of 32×32 characters. There are two running models of the tool, Malevich (XL) trained on 1.3 billion parameters with an Image encoder and Kandinsky (XXL) with 12 billion parameters. On running the former model with the same text input as the latest DALL.E example of “an armchair in the shape of an avocado”, ruDALL.E was found to comprehend combining chair and avocado in the function of a shape.

ruDALL.E outputs for ‘an armchair in the shape of an avocado’

X-LXMERT

Created by AI2 Labs, X-LXMERT is an extension of LXMERT, a transformer for visual and language connections. The tool comes with training refinements and enhanced image generation capabilities, rivalling models specialised in image generation. X-LXMERT has three key refinements: Discretising visual representations, using uniform masking with a large range of masking ratios, and aligning the right pretraining datasets to the right objectives. On their project page, the X-LXMERT research team explained the training as such: “We employ Gibbs sampling to iteratively sample features at different spatial locations. In contrast to text generation, where left-to-right is considered a natural order, there is no natural order for generating images.”

Images created by X-LXMERT

GLID-3

GLID-3 is a combination of OpenAI’s GLIDE, Latent Diffusion technique and OpenAI’s CLIP. The code is a modified version of guided diffusion and is trained on photographic-style images of people. It is a relatively smaller mode. Compared to DALL.E, GLID-3’s output is less capable of imaginative images for given prompts.

More Great AIM Stories

Avi Gopani
Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Masterclass, Virtual
How to power applications for the data-driven economy
20th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

What can SEBI learn from casinos?

It is said that casino AI technology comes with superior risk management systems compared to traditional data analytics that regulators are currently using.