Last updated February 28, 2024
In AI Mysteries, Sora OpenAI

6 Text-to-Video Generative AI Models

Over two years, text-to-video AI models has evolved from producing noisy to hyper-realistic results.

Share

Published on February 16, 2024

by Tasmia Ansari

Soon after DALL-E gave rise to text-to-image AI, companies took a step ahead and started creating text-to-video models. Over two years, the landscape has evolved from being noisy to producing hyper-realistic results using text prompts.

While the results may still be imperfect, several models today display a high degree of controllability and the ability to generate footage in various artistic styles.

Here are six latest text-to-video AI models you could try out.

Sora

ChatGPT creator OpenAI just showcased Sora, their new text-to-video model. Everyone’s excited since the model has “a deep understanding of language” and can generate “compelling characters that express vibrant emotions”. People on social media are flipping out over how realistic the videos look, calling it a total game-changer.

But, before releasing it to the public, the AI startup is taking measures to be careful about safety. They also admit that Sora has some hiccups, like struggling with keeping things smooth and telling left from right. [Sam Altman Brings CRED Founder Kunal Shah’s Wild Imagination to Life with Sora]

Click here to know more.

Lumiere

Google’s got this video generation AI called Lumiere, powered by a new diffusion model known as Space-Time-U-Net, or STUNet for short. According to Ars Technica, Lumiere doesn’t mess around with stitching together still frames; instead, it figures out where things are in a video (that’s the space part) and tracks how they move and change at the same time (that’s the time part).

It’s like one smooth process, no need for puzzle pieces.

Lumiere has yet to be ready for common folks to recreate with. But it hints at Google’s knack for crafting an AI video powerhouse that might outshine the generally available models like Runway and Pika. Google has made a tech leap in AI video games within two years.

Click here to know more.

VideoPoet

VideoPoet, is a large language model schooled on a colossal dataset of videos, images, audio, and text. This model can pull off various video generation tasks, from turning text or images into videos to jazzing up videos with style, video inpainting and outpainting, and video-to-audio.

The model is built on a straightforward idea: convert any autoregressive language model into a video-generating system. Autoregressive language models can crank out text and code like nobody’s business. But they hit a roadblock when it comes to video.

To tackle that, VideoPoet rolls with multiple tokenisers that can turn video, image, and audio clips into a language it understands.

Click here to know more.

Emu Video

Meta’s AI model involves two steps. First, it makes a picture from text. Then, it uses that text and image to create a top-notch video. The researchers achieved this by optimising noise schedules for diffusion and multi-stage training.

Human evaluators claimed they preferred it 81% more than Google’s Imagen Video, 90% picked it over NVIDIA’s PYOCO, and 96% said it was better than Meta’s own Make-A-Video. Not just that, it’s even beating commercial options like RunwayML’s Gen2 and Pika Labs.

Notably, their factorising approach is well-suited for animating images based on user text prompts, surpassing prior works by 96%.

Click here to know more.

Phenaki

The team behind Phenaki Video used Mask GIT to produce text-guided videos in PyTorch. The model can generate videos guided by text and go up to 2-min long.

The paper suggested that instead of just trusting the predicted probabilities, they’re suggesting a tweak – bringing in an extra critic to decide what to mask during sampling iteratively. This helps determine what parts to focus on during the video-making process. It’s like having a second opinion.

The model is versatile and open for researchers to train on text-to-image and text-to-video. They can start with images and then fine-tune on video for unconditional training.

Click here to know more.

CogVideo

A group of researchers from the University of Tsinghua in Beijing developed CogVideo, a large-scale pretrained text-to-video generative model. They built the model using a pre-trained text-to-image model called CogView2 to exploit the knowledge it learned from pre-training.

Now, this computer artist named Glenn Marshall tried it out. He was so impressed initially that he said directors might lose their jobs to this thing. The short film he made with CogVideo, ‘The Crow’, performed well and even got a shot at the BAFTA Awards.

Click here to know more.

Access all our open Survey & Awards Nomination forms in one place