Listen to this story
A picture is worth a thousand words. But, is it really? With text-to-image generation, a few words may be enough to create a thousand pictures.
In April 2022, OpenAI caused an uproar with the launch of its latest model, ‘DALL-E-2’, that uses text prompts to create breathtaking, high-quality images. Google Brain Team followed suit and launched ‘Imagen’, Google’s AI model based on Diffusion Models with Deep Language Understanding to create stunning images in different styles, ranging from brush-based illustrations to high-definition pictures.
Sign up for your weekly dose of what's up in emerging technology.
Conversely, Meta challenged the monotony of the text-to-image generation process with their own AI-model, ‘Make-a-scene’, that not only takes text prompts but also sketches to create high-definition visual masterpieces on a digital canvas.
Meta’s ‘Make-a-scene’ model demonstrates the empowering use of technology to augment human creativity with the help of artificial intelligence.
Innovating further by enabling users to insert visual prompts along with text prompts, Meta was able to alter the current dynamics of the AI text-to-image generation process. However, it remains debatable whether Meta’s improved AI model would be able to hold its own against conventional text-to-image models.
How does ‘Make-a-scene’ work?
The model uses an autoregressive transformer that integrates the conventional use of text and image tokens. This model also introduces implicit conditioning over ‘scene tokens’—optionally controlled and derived from segmentation maps. These segmentation tokens are either generated independently by a transformer during inference or extracted directly from the input image—providing the option to include additional constraints over the AI-generated image.
In contrast to the prevailing segmentation tokens for explicit conditioning generated by GAN-based models, ‘Make-a-scene’ uses segmentation tokens for implicit conditioning. In practice, this innovation enhances the variety of samples generated by Meta’s model.
‘Make-a-scene’ generates images upon being given a text prompt as well as an optional sketch that the AI model then references as a segmentation map.
Meta’s researchers explored beyond the scene-based approach and improved on the general and perceived quality of image generation by improving the representation of token space. The introduction of several modifications in the tokenisation process emphasised the awareness of critical aspects important to human perspective such as salient objects and faces.
In order to circumvent the need for a filtration process post image generation while simultaneously improving on the generation quality and alignment prior to the image generation—the model employs a ‘classifier-free’ guidance.
An in-depth insight into the workings of ‘Make-a-scene’ reveal four distinct elements unique to Meta’s method:
- Scene representation and tokenisation: This consists of a blend of three complementary semantic segmentation groups—panoptic, human and face. Such combinations allow the neural network to learn how to generate the semantic layout and implement various conditions in the generation of the final image.
- Identifying human preference in token space with explicit losses: With transformer-based image generation, it is evident that the generated images have an inherent upper-bound quality—a consequence of the ‘tokenisation reconstruction’ method. To mitigate this outcome, Meta’s model introduces several modifications to image reconstruction as well as segmentation methods, such as face-aware vector quantisation, face-emphasis in the scene space and object-aware vector quantisation.
- Scene-based transformer: Based on an autoregressive transformer with three independent, consecutive token spaces—text, scene and image—this method relies on an autoregressive transformer
(Image source: scene-based method high-level architecture)
before training a scene-based transformer. Each transformer has an encoded token sequence corresponding to the text-scene-image triplet that is then extracted using the corresponding encoder, which later produces a sequence. With this generated sequence, the relevant tokens are then generated by the transformer to be further encoded and decoded by corresponding networks.
- Transformer classifier-free guidance: This process guides an unconditional sample toward a conditional sample. In order to support unconditional sampling, the transformer is fine-tuned while randomly replacing the text prompts with padding tokens. Consequently, two parallel token streams are generated during inference namely, a conditional token stream, based on text and an unconditional token stream, based on an empty text stream initialised with padding tokens.
Meta’s model achieves its state-of-the-art results by virtue of in-depth comparisons with GLIDE, DALL-E, CogView and XMC-GAN based on various human and numerical prompts.
(Image source: arxiv.org)
Furthermore, the model demonstrates new creative capabilities that stem from Meta’s method which enables enhanced controllability.
In order to assess the effect of each new creative capability, a transformer with four billion parameters is used to generate a sequence of 256 text tokens, 256 scene tokens and 1024 image tokens. These tokens are then decoded to 256×256 or 512×512 pixel images.
Not open source yet
To further research and development efforts, Meta allowed access to the demo version of ‘Make-a-scene’ for certain well-known artists experienced in using state-of-the-art generative AI models. This list of artists includes Sofia Crespo, Scott Eaton, Alexander Reben and Refik Anadol.
These artists then integrated the demo model into their own creative processes to provide feedback along with several captivating images.
(Image source: facebook.blog)
Sofia Crespo, an AI-artist who focuses on fusing technology with nature, used Make-a-scene’s sketch and text prompts to create a hybrid image of a jellyfish in the shape of a flower. She noted that the freeform drawing capabilities in the model helped bring her imagination onto the digital canvas at a much quicker pace.
“It’s going to help move creativity a lot faster and help artists work with interfaces that are more intuitive.”—Sofia Crespo
Another artist, Scott Eaton—a creative technologist and educator, used Make-a-scene to compose deliberately while exploring variations with different prompts.
“Make-a-scene provides a level of control that’s been missing in other SOTA generative AI systems. Text prompting alone is very constrained, often like wandering in the dark. Being able to control the composition is a powerful extension for artists and designers.”—Scott Eaton
Researcher and roboticist, Alexander Reben was one of the artists who took a more unique approach to his feedback of the model. He used AI-generated text prompts from another AI model, created a sketch to interpret the text and fed both the text and the image into the ‘Make-a-scene’ model.
“It made quite a difference to be able to sketch things in, especially to tell the system where you wanted things to give it suggestions of where things should go, but still be surprised at the end.”—Alexander Reben