MITB Banner

OpenAI’s DALL.E Can Create Images From Text Prompts

Share

DALLE OpenAI

OpenAI has released a 12-billion parameter version of GPT-3, called DALL.E, to generate images from text prompts.

The name is a play on surrealist painter Salvador Dali and Pixar movie, WALL.E. DALL.E is a transformer language model built to receive both the text and the image as a single stream of data packing up to 1280 ‘tokens’.

Simply put, a token refers to a particular symbol from a vocabulary. For example, every letter from A-Z in the English alphabet is a token. However, in DALL.E terms, token stands for both the text and the image input.

DALL.E can render an image from scratch and also alter aspects of an image using text prompts.

Credit: OpenAI

For example, in the above image, the text prompt is — a pentagon green frame. Any alteration to the three aspects here–shape (pentagon), colour (green), object (frame), would generate a different set of images.

For a caption prompt, with changes to the above text to triangle red clock, DALL.E gives the following set of images.

Credit: OpenAI

DALL.E model is also trained for working with multiple objects in an image. For example, in a text prompt, “a hedgehog wearing a yellow hat, red gloves, blue jacket, and pink pants”, the model would need to correctly compose each of the pieces (hedgehog, hat, gloves, jacket, and pants) but also establish the correct association between each of the object– yellow hat, red gloves, blue jacket, and pink pants on a hedgehog, without mixing them up.

DALL.E model is also trained for working with multiple objects in an image. For example, in the text prompt, “a hedgehog wearing a yellow hat, red gloves, blue jacket, and pink pants,” the model would need to correctly compose each of the pieces (hedgehog, hat, gloves, jacket, and pants) but also establish the correct association between each of the object– yellow hat, red gloves, blue jacket, and pink pants on a hedgehog, without mixing them up.

The OpenAI team has also tested DALL.E’s capabilities against other specific situations, such as generating 3D imagery, cross-sectional views, and images based on contextual text caption.

Interestingly, DALL.E was often seen to take creative liberties to generate images rich in details without explicit prompts, making it a cut above 3D rendering engines that require specific and unambiguous inputs.

The DALL.E model can also perform image-to-image translation tasks based on prompts.

CLIP

Following the announcement of DALL.E, the OpenAI research team has also demonstrated a neural network called Contrastive Language-Image Pre-training or CLIP. This neural network has been trained on 400 million pairs of images and text.

In a paper introducing the model, OpenAI research team wrote, “We find that CLIP, similar to the GPT family, learns to perform a wide set of tasks during pretraining, including object character recognition (OCR), geo-localisation, action recognition, and many others. We measure this by benchmarking the zero-shot transfer performance of CLIP on over 30 existing datasets and find it can be competitive with prior task-specific supervised models.”

CLIP has proved to be highly efficient, flexible, and more generalised. Further, CLIP is far less expensive compared to deep learning models. CLIP relies on text-image pair datasets already available on the internet and can adapt to perform a broader range of visual classification tasks without requiring additional training examples.

Two Sides To A Story

The DALL.E model is already creating ripples in the research community. The OpenAI team counts on the capabilities of this model to deliver a broader societal impact. The team is also looking at DALL.E’s potential influence on specific processes and professions. However, the DALL.E is not exactly foolproof. Experts fear, like the GPT-3 model, both DALL.E and CLIP models can reinforce racial and gender stereotypes. A bias test found the CLIP model was likely to miscategorise people under 20 as criminals or non-humans. Further, the model was more likely to label men as criminals than women.

Share
Picture of Shraddha Goled

Shraddha Goled

I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.