When AI research lab OpenAI introduced DALL-E, the first text-to-image (TTI) model of its kind, it took the internet by storm. Since then, several big players have invested in this space, coming up with their own versions. ‘Generative AI’ thus became a common word in the AI ecosystem. Further—as users kept experimenting with TTI—we got text-to-video, text-to-3D, and now text-to-music.
Let’s take a look at the top nine text-to-video creators and how they came about!
To solve the problem of One-Shot Video Generation, where only one text–video pair is available for training an open-domain text-to-video (TTV) generator, researchers from Show Lab at the National University of Singapore built a text-to-video generator called Tune-A-Video. Leveraging text-to-image (TTI) diffusion models that have been previously trained, Tune-A-Video extends spatial self-attention to the spatiotemporal domain by utilising customised Sparse–Causal Attention.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Microsoft-developed NUWA-Infinity is a multimodal generative model that can create high-resolution images or long-duration videos of arbitrary size from text, image, or video inputs. It generates open-domain videos through its ‘autoregressive over autoregressive generation’ mechanism. This mechanism handles the variable-size generation task by considering dependencies between patches at a global patch level and dependencies between visual tokens within each patch at a local token level.
GenmoAI is a tool that generates AI from text input. It can make a variety of outputs, such as videos, animations, vector graphics, and more. The advanced video capabilities of GenmoAI include a dynamism feature that adjusts the amount of noise added between frames. Additionally, the tool allows you to add multiple text prompts together to form a story-like narrative.
Download our Mobile App
Meta released Make-A-Video, a TTV model that can generate high-definition, high frame-rate videos by incorporating the TTI model with the spatiotemporally factorised diffusion model. In addition, the team behind Make-A-Video has combined text with image information to remove the requirement of paired text and video data, thereby opening the door to expanding the tool to a larger quantity of video content.
Make-A-Video 3D (MAV3D)
Based on Meta‘s Make-A-Video method for 2D generation, Meta has developed MAV3D (Make-A-Video3D) for generating 3D dynamic scenes from text descriptions. The new model uses a 4D dynamic Neural Radiance Field (NeRF) optimised for scene appearance, density, and motion consistency. The Text-to-Video (TTV) model used in the process is only trained on Text–Image pairs and unlabelled videos.
Hugging Face’s CogVideo is a pre-trained transformer model for generating high-resolution (480×480) videos from the text. It claims to be the largest and first open-source model of its kind, with 9.4 billion parameters. CogVideo allows for control over the intensity of changes during video generation.
Google entered the TTV space with Imagen, a cutting-edge video synthesis model that generates high-quality videos (1280×768 at 24 frames per second) from written prompts. Imagen Video transforms initial text prompts into low-resolution videos (16 frames, 2448 pixels, at three fps) with impressive features that include creating videos of well-known artwork, 3D object rotation while maintaining structure along with animated text.
Phenaki is a TTV model that uses a series of text prompts to create realistic video synthesis. It reduces the video to a limited representation of discrete tokens. Since this tokenizer is auto-regressive in time, it can handle representations of various lengths for videos. The resulting video tokens are then de-tokenized to get the actual video using a bi-directional masked transformer conditioned on pre-created text tokens. Phenaki can make arbitrary long videos based on a series of open-domain prompts, such as time-variable text or stories. Phenaki claims to explore the creation of films using time-variable prompts.
Google‘s DeepMind introduced Transframer, a unified framework for image modelling and vision tasks using probabilistic frame prediction for various tasks, such as video interpolation, view synthesis, and image segmentation. The framework demonstrates the potential of probabilistic image models in multi-task computer vision, outperforming other models on video generation benchmarks and achieving exceptional results on eight tasks, including semantic segmentation and image classification. Based on U-Net and Transformer components, Transframer can create coherent 30-second videos from a single image.
Due to the high processing cost, scarcity of high-quality text-video data, and unpredictable length of videos, creating videos from text is increasingly challenging, even though tech firms are actively working towards improving it.
From healthcare to gaming, generative AI has the potential to revamp and enhance different sectors and aspects of daily life by automating repetitive and time-consuming tasks, offering new and innovative solutions and making them far more personalised.