Introducing Text-to-Video Generator, Tune-A-Video

With customised Sparse-Causal Attention, Tune-A-Video expands spatial self-attention to the spatiotemporal domain using pretrained text-to-image diffusion models.
Listen to this story

Since the birth of text-to-image DALL-E by OpenAI, the AI world has been working towards similar models, for example, Midjourney, and Imagen, to name a few. Soon came text-to-video models like Transframer, NUWA Infinity, CogVideo, etc. Even text-to-voice models like VALL-E were recently unveiled by Microsoft.

Last month, researchers from Show Lab, National University of Singapore came up with a text-to-video generator called Tune-A-Video (TTV) to address the issue of One-Shot Video Generation, where only a single text-video pair is provided for training an open-domain text-to-video generator. With customised Sparse-Causal Attention, Tune-A-Video expands spatial self-attention to the spatiotemporal domain using pretrained text-to-image (TTI) diffusion models.

Tune-To-Video

Check the unofficial implementation of Tune-A-Video here. 

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

In one training sample, the projection matrices in the attention block are modified to include the relevant motion information. Tune-A-Video can create temporally coherent videos for various applications, including changing the subject or background, modifying attributes, and transferring styles.

It was discovered that TTI models could produce images that match verb terms well and that expanding TTI models to generate different images at once demonstrates unexpectedly strong content consistency.


Download our Mobile App



Fine-Tuning: TTI models are expanded to TTV models using TTI model weights that have already been pretrained. The text-video pair is then subjected to one-shot tuning in order to create a one-shot TTV model. 

Inference: A modified text prompt is used to generate new videos.

After receiving a video and text pair as input, it modifies the projection matrices in attention blocks.

Read the full paper here.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Shritama Saha
Shritama is a technology journalist who is keen to learn about AI and analytics play. A graduate in mass communication, she is passionate to explore the influence of data science on fashion, drug development, films, and art.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR