Now Reading
Comprehensive Guide to DALL-E By OpenAI: Creating Images from Text

Comprehensive Guide to DALL-E By OpenAI: Creating Images from Text

Aishwarya Verma

Transformers is all the attention, we need right now!

OpenAI has recently released their text-to-image generation model based on transformers architecture called DALL-E. The name of this model is inspired from surrealist Salvador Dali and the robot from Wall-E. DALL-E is a neural network that creates images from text( that can be expressed in natural language). This model holds 12 billion parameters of autoregressive transformers(from GPT3) trained on 250 million pairs of images and text that are collected through the internet. The DALL-E model gives high-quality images on MS-COCO dataset zero shot, when trained without labels. Due to the model’s flexibility, DALL-E is able to integrate different things in a very reasonable way such as create anthropomorphized versions of animals, render text, and perform some types of image-to-image translation.

The DALL-E framework is published under OpenAI by Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever.

Here is an example of generating high-quality AI images generated from text.

Overview of DALL-E

The DALL-E is a transformer language model whose goal is to train an autoregressive transformer in order to model the text and image tokens as a single stream of data. The overall approach DALL-E can be shown as maximizing the evidence lower bound (ELB) on the joint likelihood of the model distribution over images. Using pixels as image tokens may require a high amount of memory to generate high-quality images but the use of likelihood objectives tends to capture the high-frequency structure that makes the objective more visible to us. The whole training procedure has divided into two stages:

Stage 1 :  Train a discrete Variational Autoencoder(DVAE) to compress each 256 X 256 RGB image to 32 X 32 grid of image tokens, each element of which can assume 8192 possible values. This reduces the context size of the transformer by a factor of 192 without a large degradation in visual quality.

Comparison of original images (top) and reconstructionsfrom the discrete VAE (bottom). The encoder downsamples thespatial resolution by a factor of 8. While details (e.g., the texture ofthe cat’s fur, the writing on the storefront, and the thin lines in theillustration) are sometimes lost or distorted, the main features of theimage are still typically recognizable. We use a large vocabularysize of 8192 to mitigate the loss of information(Source)

Stage 2 : Concatenate up to 256 BPE-encoded text tokens with the 32 × 32 = 1024 image tokens, and train an autoregressive transformer to model the joint distribution over the text and image tokens.

Note : The details of DVAE are given in the Appendix of Research Paper .

Apart from generating images from scratch, the above approach helps to reproduce a consistent image with the text that can extend to the bottom-right corner from any rectangular region of any image.

Application use of DALL-E

  • Controlling attributes
  • Drawing multiple objects 
  • Visualizing perspective and three-dimensionality 
  • Visualizing internal and external structure
  • Inferring contextual details
  • Applications of preceding capabilities
  • Combining unrelated concepts
  • Animal illustrations
  • Zero-shot visual reasoning
  • Geographic knowledge
  • Temporal knowledge

Requirements & Installation

The package which we are going to install is the PyTorch implementation of discrete VAE used for DALL-E. You can install this package via pip.

!pip install DALL-E

Demo of Using Pretrained D-VAE for DALL-E

  1. Import all the required packages and modules.
 import io
 import os, sys
 import requests
 import PIL
 import torch
 import torchvision.transforms as T
 import torchvision.transforms.functional as TF
 from dall_e import map_pixels, unmap_pixels, load_model
 from IPython.display import display, display_markdown 
  1. Make helper functions:

a) For dowloading the image

See Also

 target_image_size = 256
 def download_image(url):
     resp = requests.get(url)

b) For preprocessing the downloaded image

 def preprocess(img):
     s = min(img.size)
     if s < target_image_size:
         raise ValueError(f'min dim for image {s} < {target_image_size}')
     r = target_image_size / s
     s = (round(r * img.size[1]), round(r * img.size[0]))
     img = TF.resize(img, s, interpolation=PIL.Image.LANCZOS)
     img = TF.center_crop(img, output_size=2 * [target_image_size])
     img = torch.unsqueeze(T.ToTensor()(img), 0)
     return map_pixels(img) 
  1.  Load  the models for encoder and decoder.
 # This can be changed to a GPU, e.g. 'cuda:0'.
 dev = torch.device('cpu')
 # For faster load times, download these files locally and use the local paths instead.
 enc = load_model("", dev)
 dec = load_model("", dev) 
  1. Download the image from the url and preprocess it.
 x = preprocess(download_image(''))
 display_markdown('Original image:')

The input image is :

  1. Reconstruct the image.
 import torch.nn.functional as F
 z_logits = enc(x)
 z = torch.argmax(z_logits, axis=1)
 z = F.one_hot(z, num_classes=enc.vocab_size).permute(0, 3, 1, 2).float()
 x_stats = dec(z).float()
 x_rec = unmap_pixels(torch.sigmoid(x_stats[:, :3]))
 x_rec = T.ToPILImage(mode='RGB')(x_rec[0])
 display_markdown('Reconstructed image:')

    The output image after reconstruction will look like:


In this post, we have given an overview of DALL-E, a very simple method for text-to-image generation based on an autoregressive transformer. 

Note : All the images except for the output are taken from official sources.

Official Codes, Documentation & Tutorials are available at : 

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join Our Telegram Group. Be part of an engaging online community. Join Here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top