Top 4 DALL.E alternatives, text-to-image generators

DALL·E 2 was preferred over DALL·E 1 for its caption matching and photorealism.

Creativity is just connecting things, said Steve Jobs: He was channelling his inner Einstein (coincidentally another Walter Isaacson muse), who had come up with ‘combinatory play’ to explain the inner workings of creative thought. OpenAI took the hint, and built a text-to-image generator, DALL.E. 

OpenAI has got creativity down to a science. Almost! The astronaut riding a horse in a photorealistic style or teddy bears mixing sparkling chemicals as mad scientist as a 1990s Saturday morning cartoon are good cases in point. The ultra-imaginative DALL.E has become the talk of the town in a short time. Below, we look at similar models making the rounds in the world of AI.


In 2020, OpenAI introduced GPT-3 and, a year later, DALL.E, a 12 billion parameter model, built on GPT-3. DALL.E was trained to generate images from text descriptions, and the latest release, DALL.E 2, generates even more realistic and accurate images with 4x better resolution. The model takes natural language captions and uses a dataset of text-image pairings to create realistic images. Additionally, it can take an image and create different variations inspired by original images.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

DALL.E leverages the ‘diffusion’ process to learn the relationship between images and text descriptions. In diffusion, it starts with a pattern of random dots and tracks it towards an image when it recognises aspects of it. Diffusion models have emerged as a promising generative modelling framework and push the state-of-the-art image and video generation tasks. The guidance technique is leveraged in diffusion to improve sample fidelity for images and photorealism. DALL.E is made up of two major parts: a discrete autoencoder that accurately represents images in compressed latent space and a transformer that learns the correlations between language and this discrete image representation. Evaluators were asked to compare 1,000 image generations from each model, and DALL·E 2 was preferred over DALL·E 1 for its caption matching and photorealism.

DALL-E is currently only a research project, and is not available in OpenAI’s API. 

Download our Mobile App

DALL.E outputs for ‘an armchair in the shape of an avocado’


Earlier, the OpenAI research team introduced an open-sourced text-image tool, CLIP. The neural network Contrastive Language-Image Pre-training was trained on 400 million pairs of images and text. The tool efficiently learns visual concepts from natural language supervision and can be applied for classification by providing the names of the visual categories to be recognised. In a paper introducing the model, the OpenAI research team wrote about CLIP’s ability to perform various tasks during pretraining, including object character recognition (OCR), geo-localisation, action recognition, and more. CLIP has proven to be highly efficient, flexible, and more generalised. Furthermore, it is far less expensive, given CLIP relies on text-image pair datasets already available on the internet. It can adapt to perform a broader range of visual classification tasks. 


ruDALL-E takes a short description and generates images based on them. The model understands a wide range of concepts and generates completely new images and objects that did not exist in the real world. The Russian take on OpenAI, ruDALL.E, is trained on ruGPT-3, which was trained on 600GB of Russian text. The Russian ruDALL.E model boasts 1.3 billion parameters and a YTTM text tokeniser with a dictionary of 16,000 tokens. It leverages a custom VQGAN model that converts an image into a sequence of 32×32 characters. There are two running models of the tool, Malevich (XL) trained on 1.3 billion parameters with an Image encoder and Kandinsky (XXL) with 12 billion parameters. On running the former model with the same text input as the latest DALL.E example of “an armchair in the shape of an avocado”, ruDALL.E was found to comprehend combining chair and avocado in the function of a shape.

ruDALL.E outputs for ‘an armchair in the shape of an avocado’


Created by AI2 Labs, X-LXMERT is an extension of LXMERT, a transformer for visual and language connections. The tool comes with training refinements and enhanced image generation capabilities, rivalling models specialised in image generation. X-LXMERT has three key refinements: Discretising visual representations, using uniform masking with a large range of masking ratios, and aligning the right pretraining datasets to the right objectives. On their project page, the X-LXMERT research team explained the training as such: “We employ Gibbs sampling to iteratively sample features at different spatial locations. In contrast to text generation, where left-to-right is considered a natural order, there is no natural order for generating images.”

Images created by X-LXMERT


GLID-3 is a combination of OpenAI’s GLIDE, Latent Diffusion technique and OpenAI’s CLIP. The code is a modified version of guided diffusion and is trained on photographic-style images of people. It is a relatively smaller mode. Compared to DALL.E, GLID-3’s output is less capable of imaginative images for given prompts.

Sign up for The AI Forum for India

Analytics India Magazine is excited to announce the launch of AI Forum for India – a community, created in association with NVIDIA, aimed at fostering collaboration and growth within the artificial intelligence (AI) industry in India.

Avi Gopani
Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.

Our Upcoming Events

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023

21 Jul, 2023 | New York
MachineCon USA 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

GPT-4: Beyond Magical Mystery

The OpenAI CEO believes that by ingesting human knowledge, the model is acquiring a form of reasoning capability that could be additive to human wisdom in some senses.