When AI research lab OpenAI introduced DALL-E, the first text-to-image (TTI) model of its kind, it took the internet by storm. Since then, several big players have invested in this space, coming up with their own versions. ‘Generative AI’ thus became a common word in the AI ecosystem. Further—as users kept experimenting with TTI—we got text-to-video, text-to-3D, and now text-to-music.
Let’s take a look at the top nine text-to-video creators and how they came about!
Tune-A-Video
To solve the problem of One-Shot Video Generation, where only one text–video pair is available for training an open-domain text-to-video (TTV) generator, researchers from Show Lab at the National University of Singapore built a text-to-video generator called Tune-A-Video. Leveraging text-to-image (TTI) diffusion models that have been previously trained, Tune-A-Video extends spatial self-attention to the spatiotemporal domain by utilising customised Sparse–Causal Attention.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation is out
— AK (@_akhaliq) January 29, 2023
github: https://t.co/pqclO38LSv pic.twitter.com/QdFpizRS3n
NUWA Infinity
Microsoft-developed NUWA-Infinity is a multimodal generative model that can create high-resolution images or long-duration videos of arbitrary size from text, image, or video inputs. It generates open-domain videos through its ‘autoregressive over autoregressive generation’ mechanism. This mechanism handles the variable-size generation task by considering dependencies between patches at a global patch level and dependencies between visual tokens within each patch at a local token level.
NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis
— AK (@_akhaliq) July 21, 2022
abs: https://t.co/vZz48wA5z2
project page: https://t.co/PtzdAq2nvN
Compared to DALL·E, Imagen/Parti , generates HR images with arbitrary sizes and support long-duration video generation pic.twitter.com/0jO3GjSVx5
GenmoAI
GenmoAI is a tool that generates AI from text input. It can make a variety of outputs, such as videos, animations, vector graphics, and more. The advanced video capabilities of GenmoAI include a dynamism feature that adjusts the amount of noise added between frames. Additionally, the tool allows you to add multiple text prompts together to form a story-like narrative.
Download our Mobile App
2. Dynamic Art Forms:
— Shubham Saboo (@Saboo_Shubham_) February 1, 2023
Unlike other image generation tools, Genmo goes beyond traditional 2D images to allow you to create videos, animations, vector design assets, and more. One-stop resource with no limitations! pic.twitter.com/vvSK4DvEkP
Make-A-Video
Meta released Make-A-Video, a TTV model that can generate high-definition, high frame-rate videos by incorporating the TTI model with the spatiotemporally factorised diffusion model. In addition, the team behind Make-A-Video has combined text with image information to remove the requirement of paired text and video data, thereby opening the door to expanding the tool to a larger quantity of video content.
We’re pleased to introduce Make-A-Video, our latest in #GenerativeAI research! With just a few words, this state-of-the-art AI system generates high-quality videos from text prompts.
— Meta AI (@MetaAI) September 29, 2022
Have an idea you want to see? Reply w/ your prompt using #MetaAI and we’ll share more results. pic.twitter.com/q8zjiwLBjb
Make-A-Video 3D (MAV3D)
Based on Meta‘s Make-A-Video method for 2D generation, Meta has developed MAV3D (Make-A-Video3D) for generating 3D dynamic scenes from text descriptions. The new model uses a 4D dynamic Neural Radiance Field (NeRF) optimised for scene appearance, density, and motion consistency. The Text-to-Video (TTV) model used in the process is only trained on Text–Image pairs and unlabelled videos.
Text-To-4D Dynamic Scene Generation
— Aran Komatsuzaki (@arankomatsuzaki) January 27, 2023
Presents MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions.
proj: https://t.co/KSq8okuWPJ
abs: https://t.co/qbAlZYOFTr pic.twitter.com/HIYYQMZKXG
CogVideo
Hugging Face’s CogVideo is a pre-trained transformer model for generating high-resolution (480×480) videos from the text. It claims to be the largest and first open-source model of its kind, with 9.4 billion parameters. CogVideo allows for control over the intensity of changes during video generation.
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
— AK (@_akhaliq) May 29, 2022
github: https://t.co/1JuOHU7puc pic.twitter.com/Wilcq2Xxb9
Imagen
Google entered the TTV space with Imagen, a cutting-edge video synthesis model that generates high-quality videos (1280×768 at 24 frames per second) from written prompts. Imagen Video transforms initial text prompts into low-resolution videos (16 frames, 2448 pixels, at three fps) with impressive features that include creating videos of well-known artwork, 3D object rotation while maintaining structure along with animated text.
Meet Imagen Video and Phenaki, two research approaches for text-to-video generation.
— Google AI (@GoogleAI) November 2, 2022
By combining diffusion & sequence learning techniques, we can generate videos that are super-res at the frame level and coherent in time. (4/5)https://t.co/O7gGzb9knWhttps://t.co/Uc0krTyTvk pic.twitter.com/Op4tonX2iw
Phenaki
Phenaki is a TTV model that uses a series of text prompts to create realistic video synthesis. It reduces the video to a limited representation of discrete tokens. Since this tokenizer is auto-regressive in time, it can handle representations of various lengths for videos. The resulting video tokens are then de-tokenized to get the actual video using a bi-directional masked transformer conditioned on pre-created text tokens. Phenaki can make arbitrary long videos based on a series of open-domain prompts, such as time-variable text or stories. Phenaki claims to explore the creation of films using time-variable prompts.
Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions
— AK (@_akhaliq) September 29, 2022
abs: https://t.co/gsZrW80Aax
project page: https://t.co/mIzxeMRKk8
Generating videos from text, with prompts that can change over time, and videos that can be as long as multiple minutes pic.twitter.com/GSDGEURaJD
Transframer
Google‘s DeepMind introduced Transframer, a unified framework for image modelling and vision tasks using probabilistic frame prediction for various tasks, such as video interpolation, view synthesis, and image segmentation. The framework demonstrates the potential of probabilistic image models in multi-task computer vision, outperforming other models on video generation benchmarks and achieving exceptional results on eight tasks, including semantic segmentation and image classification. Based on U-Net and Transformer components, Transframer can create coherent 30-second videos from a single image.
Transframer is a general-purpose generative framework that can handle many image and video tasks in a probabilistic setting. New work shows it excels in video prediction and view synthesis, and can generate 30s videos from a single image: https://t.co/wX3nrrYEEa 1/ pic.twitter.com/gQk6f9nZyg
— DeepMind (@DeepMind) August 15, 2022
Due to the high processing cost, scarcity of high-quality text-video data, and unpredictable length of videos, creating videos from text is increasingly challenging, even though tech firms are actively working towards improving it.
From healthcare to gaming, generative AI has the potential to revamp and enhance different sectors and aspects of daily life by automating repetitive and time-consuming tasks, offering new and innovative solutions and making them far more personalised.