Listen to this story
|
When OpenAI announced DALL-E in 2021, the internet fell in love with the text-to-image AI generator. It helped AI become more mainstream. While its successor, DALL-E 2, is the most popular, there are other budding AI image generators such as Midjourney, Craiyon, and Imagen.
But, the development in the text-to-video segment has faced several hurdles.
The computation cost is exponentially higher for text-to-video generation, which makes the training from scratch nearly unaffordable. The lack of relevant datasets also adds to the problem. However, researchers across the globe are now slowly breaking these barriers.
Let’s look at some of the most recent, noteworthy developments in this space.
Stable Diffusion teams up with Runway
Stable Diffusion is a new text-to-image generator launched earlier in August, 2022 and it is completely open source.
In an interview with Yannic Kilcher, Emad Mostaque said, “DALL-E 2 was a fantastic experience, but Stable Diffusion is about 30 times more efficient and runs on a consumer graphics card for DALL-E 2 level image quality.”
He further added, “This model generates images in about three seconds on 5 gigabytes of VRAM whereas other image models require like 40 gigabytes or 20 gigabytes of VRAM, and they’re super slow.”
However, what caught everyone’s attention was a tweet by Patrick Esser, research scientist at Runway. He took to Twitter to announce that Stable Diffusion would be coming to Runway for text-to-video-editing soon.
Esser attached a two-minute clip where different prompts were used to generate videos of a man playing tennis, and that was more than enough to create a social media buzz.
#stablediffusion text-to-image checkpoints are now available for research purposes upon request at https://t.co/7SFUVKoUdl
— Patrick Esser (@pess_r) August 11, 2022
Working on a more permissive release & inpainting checkpoints.
Soon™ coming to @runwayml for text-to-video-editing pic.twitter.com/7XVKydxTeD
While these announcements have definitely received a lot of attention, we still have to wait and see what the team has to offer in terms of development.
DeepMind’s ‘Transframer’ can generate coherent 30-second videos
Recently, Deepmind announced ‘Transframer’—a new model that unifies a broad range of tasks from image segmentation and view synthesis to video interpolation.
Transframer is a general-purpose generative framework that can handle many image and video tasks in a probabilistic setting. New work shows it excels in video prediction and view synthesis, and can generate 30s videos from a single image: https://t.co/wX3nrrYEEa 1/ pic.twitter.com/gQk6f9nZyg
— DeepMind (@DeepMind) August 15, 2022
“Transframer is state-of-the-art on a variety of video generation benchmarks, is competitive with the strongest models on few-shot view synthesis, and can generate coherent 30-second videos from a single image without any explicit geometric information,” the researchers explained in a blog post.

Microsoft’s ‘NUWA Infinity’ can generate high-quality videos from any given prompts
In July 2022, Microsoft’s Asia research team introduced ‘NUWA-Infinity’, a multimodal generative model designed to generate high-quality images and videos from any given text, image, or video input.
Along with text-to-image, NUWA Infinity can also generate unseen videos from simple text prompts. It can also generate videos from sketches. NUWA Infinity is capable of generating temporary consistent open domain videos.
NUWA Infinity, like others of its kind, is currently unavailable to the public. It is however available for research purposes to select individuals.
‘CogVideo’ is the largest AI text-to-video generator
Released earlier in 2022, ‘CogVideo’ is possibly the largest and first open-source AI text-to-video generator. This AI model can generate high-resolution(480×480) videos.
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
— AK (@_akhaliq) May 29, 2022
github: https://t.co/1JuOHU7puc pic.twitter.com/Wilcq2Xxb9
“Here, we present a large-scale pre-trained text-to-video generative model, CogVideo, which is of 9.4 billion parameters and trained on 5.4 million text-video pairs. We build CogVideo based on a pre-trained text-to-image model, CogView2, in order to inherit the knowledge learned from the text-image pre-training,” the research paper stated.

Roadblocks
A major roadblock besides high computing costs is the lack of accurate datasets. In fact, VATEX, which is the largest multilingual video description dataset, only contains about 41,250 videos and 825,000 captions.
The VATEX dataset contains videos in English as well as Chinese, whereas most of the other datasets are only available in English.
The researchers for CogVideo noted, “The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics.” Therefore, the team trained the model by inheriting a pre-trained text-to-image model, ‘CogView2’.
Similarly, the NUWA Infinity team also concluded that most existing datasets could not be used in training or evaluation. Hence, they developed four new databases with high resolutions to train their model.
What we think
The recent development happening in the space of generative AI imagery and video is exciting. Earlier in 2022, the US-based food processing company ‘Heinz’ used DALL-E for its ‘Draw Ketchup’ campaign.
Likewise, a US-based monthly fashion and entertainment magazine ‘Cosmopolitan’ also used DALL-E 2 to design one of its magazine covers.
When it comes to image-to-video, AI generator models will also have implications—especially in visual effect and CGI. With time, these models are only going to become more sophisticated and we are likely to witness far more superior AI text-to-image or text-to-video generators.
However, this also raises concerns similar to those associated with deep fakes. While the development of these AI models is encouraging, there should simultaneously be measures in place to counter its misuse.