Listen to this story
|
Text-to-image generative models like OpenAI’s DALL-E 2 are attracting significant attention because of their ability to produce images merely based on text prompts. While DALL-E 2 is the most popular, there are other budding AI image generators such as Ultraleap’s ‘Midjourney’, Hugging Face’s ‘Craiyon’, Meta’s ‘Make-A-Scene’ and Google’s ‘Imagen’.
Now, it seems that Microsoft also wants a share of the ‘AI image generator’ pie. Recently, Microsoft’s Asia research team introduced NUWA-Infinity, which is a multimodal generative model designed to generate high-quality images and videos from any given text, image or video input.
NUWA-Infinity
In its research paper titled, ‘NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis’, Microsoft said that they evaluated NUWA-Infinity on five high-resolution visual synthesis tasks—
- Unconditional Image Generation
- Text-to-Image
- Text-to-Video
- Image Animation
- Image Outpainting
Compared to its predecessor ‘NUWA’, which also covers images and videos, NUWA-Infinity has superior visual synthesis capabilities in terms of resolution and variable-size generation.
Since NUWA-Infinity is focused on generating high resolutions and long duration images and videos, most existing datasets cannot be used in training or evaluation. Hence, the team developed four new databases with high resolutions to train the model.
The team further revealed that they will pre-train the next version of NUWA-Infinity with more collected visual data and report its generalisation capabilities on open-domain inputs.
But the biggest catch is that NUWA Infinity can generate videos from text. It can generate unseen videos from a simple prompt. Also, it can generate videos from sketches. It can generate temporary consistent open domain videos.
Furthermore, it can also predict the next frames in a video. One can input an image and ask the machine to predict the future frames and NUWA Infinity will predict the future of the image, be it a landscape or the image of a human face.
Another catchy aspect of NUWA Infinity is that it is able to generate images with resolution as high as 38912 × 2048. Higher resolution not only implies more details, but also wider views.
(Image source: Microsoft)
(Image source: Microsoft)
How does it fare against its competitors?
Firstly, what sets NUWA-Infinity apart from its competitors is that it is designed to generate not only high-quality images but also videos from a given text, image, or video, something that neither of its competitors are capable of.
“Compared to DALL-E 2, Imagen and MidJourney, NUWA-Infinity can generate high-resolution images with arbitrary sizes and support long-duration video generation”, says Microsoft.
DALL-E 2 generates image embedding from an input text based on either an autoregressive or a diffusion model and uses a diffusion model to produce the output image. Google’s Imagen uses a frozen large-scale pre-trained language model ‘T5-XXL’ to encode each input text and uses two diffusion models to generate high-resolution images based on the text embeddings.
However, both of these diffusion-based text-to-image generation methods cannot support arbitrarily sized image generation, as the size of the output images is pre-defined before training and inference.
NUWA Infinity introduces the autoregressive over autoregressive mechanism into the generation procedure, which enables the capability of generating variable-size images and videos, Microsoft explained.
NUWA-Infinity has the ability to stretch images to create one with a larger size and resolution. The same is demonstrated by stretching the painting, ‘The Starry Night’ by artist Vincent van Gogh. The AI model is able to stress the image without compromising the image quality.
Original vs NUWA-Infinity generated
(Original artwork: Vincent van Gogh)
(Stretched image: NUWA-Infinity)
Furthermore, NUWA-Infinity is also capable of bringing static images to life with an overly realistic result. It is able to turn an image into a video and display eye-catching vividness.
(Still image)
(Moving image generated by NUWA-infinity)
When it comes to availability to the public, AI models like DALL-E 2 and Midjourney are available to the public under different pricings, however, NUWA Infinity is currently not available to the public. It is available to selected individuals and for research purposes only.
Google has decided against releasing Imagen to the public due to risks of misuse. Similarly, Meta’s Make-a-Scene would be open exclusively to specific AI artists.
The internet loves AI image generators
Recently, OpenAI, a company in which Microsoft has also invested in, announced that it would start selling DALL-E 2 to a million people on its waiting list. Even prior to this, users who had access to DALL-E 2 were using the AI to generate creative images through prompts and were posting them on social media.
Most recently, a TikTok user used the prompt ‘selfie at the end of the world’ on DALL-E 2 and posted the results creating a social media buzz. The results, however, could be unpleasant for some as it has an apocalyptic feel to it.
DALL-E AI asked to create the "last selfie taken on Earth" and it’s rather unpleasant. pic.twitter.com/haKXGnlL0n
— Schrödinger's Witch (@rarebeverage) July 31, 2022
Max Woolf, Data Scientist at BuzzFeed, also took to Twitter recently to show off his experiment with DALL-E 2. Woolf used the prompt ‘Darth Vader wearing a tuxedo with his prom date in awkward prom photos’ and the results were fascinating, to say the least.
(Image source: Max Woolf)
Microsoft hopes that NUWA-Infinity would help visual content creators save time, cut costs, and increase productivity and creativity.