Last updated July 18, 2023
In AI News & Update

Hugging Showcases Demos Based On Open Source Text-To-Video Models, Pinpoints Flaws

Hugging Face's AI WebTV aims make open-source text-to-video models like Zeroscope and MusicGen more accessible.

Share

Published on July 18, 2023

by Tasmia Ansari

Listen to this story

Hugging Face, the AI developers’ go-to platform has released AI WebTV, as the latest advancement in automatic video and music synthesis. The model aims to advocate for open-source accessible text-to-video models like Zeroscope and MusicGen.

The technique excels in replacing backgrounds during camera panning or rotation. Moreover, it gives users creative freedom, granting control over the number of frames in the generation process, resulting in high quality slow-motion effects. The prime video model behind the WebTV is Zeroscope V2, that can be implemented in NodeJS and TypeScript.

The HF model works by taking video shot prompts, which then via a text-to-video model, generate results in a sequence of takes. To enhance the creative process further, a human-authored base theme and idea are fed into a large language model, to generate diverse individual prompts for each video clip.

Prompt: 3D rendered animation showing a group of food characters forming a pyramid, with a banana standing triumphantly on top. In a city with cotton candy clouds and chocolate road, Pixar’s style, CGI, ambient lighting, direct sunlight, rich color scheme, ultra realistic, cinematic, photorealistic.

Talking about the ability of text-to-video models, the HF blog stated, “We’ve seen it with large language models and their ability to synthesize convincing content that mimics human responses, but this takes things to a whole new dimension when applied to video,” said the HF blog authored by Julian Bilcke.

The video sequences released along with the demo are made short, to show WebTV as a tech demo rather than an actual show with an art direction or programming.

Even though the advancement is being lauded, HF has pointed out a few cases where the model fails. Firstly, it can have issues with movement and direction. For instance, a clip can sometimes be played in reverse. Also, at certain instances the modifier keyword is not taken into account. Furthermore, the model sometimes injects words from the prompt which can appear in the video.

Source: https://huggingface.co/blog/ai-webtv

Similar to HF’s model, last year in September Meta AI released Make-A-Video but the model remains closed source like the majority of services announced the the tech giant.

Access all our open Survey & Awards Nomination forms in one place

Tasmia Ansari

Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.