Comprehensive Guide to DALL-E By OpenAI: Creating Images from Text


Transformers is all the attention, we need right now!

OpenAI has recently released its text-to-image generation model based on transformers architecture called DALL-E. The name of this model is inspired by surrealist Salvador Dali and the robot from Wall-E. DALL-E is a neural network that creates images from text( that can be expressed in natural language). This model holds 12 billion parameters of autoregressive transformers(from GPT3) trained on 250 million pairs of images and text that are collected through the internet. The DALL-E model gives high-quality images on MS-COCO dataset zero shot, when trained without labels. Due to the model’s flexibility, DALL-E is able to integrate different things in a very reasonable way such as creating anthropomorphized versions of animals, rendering text, and performing some types of image-to-image translation.

The DALL-E framework is published under OpenAI by Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever.

Here is an example of generating high-quality AI images generated from text.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Overview of DALL-E

The DALL-E is a transformer language model whose goal is to train an autoregressive transformer in order to model the text and image tokens as a single stream of data. The overall approach DALL-E can be shown as maximizing the evidence lower bound (ELB) on the joint likelihood of the model distribution over images. Using pixels as image tokens may require a high amount of memory to generate high-quality images but the use of likelihood objectives tends to capture the high-frequency structure that makes the objective more visible to us. The whole training procedure has divided into two stages:

Stage 1 :  Train a discrete Variational Autoencoder(DVAE) to compress each 256 X 256 RGB image to 32 X 32 grid of image tokens, each element of which can assume 8192 possible values. This reduces the context size of the transformer by a factor of 192 without a large degradation in visual quality.

Comparison of original images (top) and reconstructionsfrom the discrete VAE (bottom). The encoder downsamples thespatial resolution by a factor of 8. While details (e.g., the texture ofthe cat’s fur, the writing on the storefront, and the thin lines in theillustration) are sometimes lost or distorted, the main features of theimage are still typically recognizable. We use a large vocabularysize of 8192 to mitigate the loss of information(Source)

Stage 2 : Concatenate up to 256 BPE-encoded text tokens with the 32 × 32 = 1024 image tokens, and train an autoregressive transformer to model the joint distribution over the text and image tokens.

Note : The details of DVAE are given in the Appendix of Research Paper .

Apart from generating images from scratch, the above approach helps to reproduce a consistent image with the text that can extend to the bottom-right corner from any rectangular region of any image.

Application use of DALL-E

  • Controlling attributes
  • Drawing multiple objects 
  • Visualizing perspective and three-dimensionality 
  • Visualizing internal and external structure
  • Inferring contextual details
  • Applications of preceding capabilities
  • Combining unrelated concepts
  • Animal illustrations
  • Zero-shot visual reasoning
  • Geographic knowledge
  • Temporal knowledge

Requirements & Installation

The package which we are going to install is the PyTorch implementation of discrete VAE used for DALL-E. You can install this package via pip.

!pip install DALL-E

Demo of Using Pretrained D-VAE for DALL-E

  1. Import all the required packages and modules.
 import io
 import os, sys
 import requests
 import PIL
 import torch
 import torchvision.transforms as T
 import torchvision.transforms.functional as TF
 from dall_e import map_pixels, unmap_pixels, load_model
 from IPython.display import display, display_markdown 
  1. Make helper functions:

a) For dowloading the image

 target_image_size = 256
 def download_image(url):
     resp = requests.get(url)

b) For preprocessing the downloaded image

 def preprocess(img):
     s = min(img.size)
     if s < target_image_size:
         raise ValueError(f'min dim for image {s} < {target_image_size}')
     r = target_image_size / s
     s = (round(r * img.size[1]), round(r * img.size[0]))
     img = TF.resize(img, s, interpolation=PIL.Image.LANCZOS)
     img = TF.center_crop(img, output_size=2 * [target_image_size])
     img = torch.unsqueeze(T.ToTensor()(img), 0)
     return map_pixels(img) 
  1.  Load  the models for encoder and decoder.
 # This can be changed to a GPU, e.g. 'cuda:0'.
 dev = torch.device('cpu')
 # For faster load times, download these files locally and use the local paths instead.
 enc = load_model("", dev)
 dec = load_model("", dev) 
  1. Download the image from the url and preprocess it.
 x = preprocess(download_image(''))
 display_markdown('Original image:')

The input image is :

  1. Reconstruct the image.
 import torch.nn.functional as F
 z_logits = enc(x)
 z = torch.argmax(z_logits, axis=1)
 z = F.one_hot(z, num_classes=enc.vocab_size).permute(0, 3, 1, 2).float()
 x_stats = dec(z).float()
 x_rec = unmap_pixels(torch.sigmoid(x_stats[:, :3]))
 x_rec = T.ToPILImage(mode='RGB')(x_rec[0])
 display_markdown('Reconstructed image:')

    The output image after reconstruction will look like this:


In this post, we have given an overview of DALL-E, a very simple method for text-to-image generation based on an autoregressive transformer. 

Note : All the images except for the output are taken from official sources.

Official Codes, Documentation & Tutorials are available at : 

Aishwarya Verma
A data science enthusiast and a post-graduate in Big Data Analytics. Creative and organized with an analytical bent of mind.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox