MITB Banner

6 Text-to-Video Generative AI Models 

Over two years, text-to-video AI models has evolved from producing noisy to hyper-realistic results.

Share

Soon after DALL-E gave rise to text-to-image AI, companies took a step ahead and started creating text-to-video models. Over two years, the landscape has evolved from being noisy to producing hyper-realistic results using text prompts.

While the results may still be imperfect, several models today display a high degree of controllability and the ability to generate footage in various artistic styles. 

Here are six latest text-to-video AI models you could try out. 

Sora

ChatGPT creator OpenAI just showcased Sora, their new text-to-video model. Everyone’s excited since the model has “a deep understanding of language” and can generate “compelling characters that express vibrant emotions”. People on social media are flipping out over how realistic the videos look, calling it a total game-changer.

But, before releasing it to the public, the AI startup is taking measures to be careful about safety. They also admit that Sora has some hiccups, like struggling with keeping things smooth and telling left from right. [Sam Altman Brings CRED Founder Kunal Shah’s Wild Imagination to Life with Sora]

Click here to know more. 

Lumiere

Google’s got this video generation AI called Lumiere, powered by a new diffusion model known as Space-Time-U-Net, or STUNet for short. According to Ars Technica, Lumiere doesn’t mess around with stitching together still frames; instead, it figures out where things are in a video (that’s the space part) and tracks how they move and change at the same time (that’s the time part). 

It’s like one smooth process, no need for puzzle pieces.

Lumiere has yet to be ready for common folks to recreate with. But it hints at Google’s knack for crafting an AI video powerhouse that might outshine the generally available models like Runway and Pika. Google has made a tech leap in AI video games within two years.

Click here to know more.

VideoPoet

VideoPoet, is a large language model schooled on a colossal dataset of videos, images, audio, and text. This model can pull off various video generation tasks, from turning text or images into videos to jazzing up videos with style, video inpainting and outpainting, and video-to-audio.

The model is built on a straightforward idea: convert any autoregressive language model into a video-generating system. Autoregressive language models can crank out text and code like nobody’s business. But they hit a roadblock when it comes to video. 

To tackle that, VideoPoet rolls with multiple tokenisers that can turn video, image, and audio clips into a language it understands.

Click here to know more.

Emu Video

Meta’s AI model involves two steps. First, it makes a picture from text. Then, it uses that text and image to create a top-notch video. The researchers achieved this by optimising noise schedules for diffusion and multi-stage training. 

Human evaluators claimed they preferred it 81% more than Google’s Imagen Video, 90% picked it over NVIDIA’s PYOCO, and 96% said it was better than Meta’s own Make-A-Video. Not just that, it’s even beating commercial options like RunwayML’s Gen2 and Pika Labs.

Notably, their factorising approach is well-suited for animating images based on user text prompts, surpassing prior works by 96%.

Click here to know more. 

Phenaki

The team behind Phenaki Video used Mask GIT to produce text-guided videos in PyTorch. The model can generate videos guided by text and go up to 2-min long. 

The paper suggested that instead of just trusting the predicted probabilities, they’re suggesting a tweak – bringing in an extra critic to decide what to mask during sampling iteratively. This helps determine what parts to focus on during the video-making process. It’s like having a second opinion.

The model is versatile and open for researchers to train on text-to-image and text-to-video. They can start with images and then fine-tune on video for unconditional training.

Click here to know more. 

CogVideo

A group of researchers from the University of Tsinghua in Beijing developed CogVideo, a large-scale pretrained text-to-video generative model. They built the model using a pre-trained text-to-image model called CogView2 to exploit the knowledge it learned from pre-training. 

Now, this computer artist named Glenn Marshall tried it out. He was so impressed initially that he said directors might lose their jobs to this thing. The short film he made with CogVideo, ‘The Crow’, performed well and even got a shot at the BAFTA Awards.

Click here to know more.

Share
Picture of Tasmia Ansari

Tasmia Ansari

Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.