Listen to this story
|
AI art tools are changing the idea of creativity and getting whackier every week. In a span of just a few years, AI art generators have gone from creating incomprehensible pictures to realistic content. Researchers at Meta AI just took a leap into generating art through prompts. The company on Thursday announced Make-A-Video, a new AI system that turns text prompts into brief, soundless video clips.
“Generative AI research is pushing creative expression forward by giving people tools to quickly and easily create new content,” Meta said in a blog post on Thursday. “With just a few words or lines of text, Make-A-Video can bring imagination to life and create one-of-a-kind videos full of vivid colours and landscapes.”
The Make-A-Video webpage includes short clips of home-video quality which look fairly realistic. These result from the prompt “A robot dancing at Times Square” or one meant to show “Hyper-realistic spaceship landing on mars”.
Apart from text-to-video generation, the tool can add motion to static images and also fill in the content between two images. Furthermore, one can also present a video, and Make-A-Video will generate different variations. Head to Make-A-Video’s web page to see more of what it can do.
The Tech Check
In his Facebook post, Mark Zuckerberg highlighted how difficult it is to generate videos than photos beyond correctly generating each pixel as the system also has to predict the changes over time.
The key technology behind Make-A-Video—and why it has arrived sooner than anticipated—is that it builds off existing work with text-to-image synthesis used with image generators like OpenAI’s DALL-E.
Instead of training the Make-A-Video model on labelled video data, Meta took image synthesis data and applied unlabeled video training data so the model learns a sense of where a text or image prompt might exist in time and space. Then it can predict what comes after the image and display the scene in motion for less than five seconds.
In the paper, Meta’s researchers note that Make-A-Video is training on pairs of images, captions and also unlabeled video footage. Training content was sourced from WebVid-10M and HD-VILA-100M datasets containing millions of videos of thousands of hours of footage. This includes footage created by sites like Shutterstock and scraped from the web. As per the paper, the model has several technical limitations beyond blurry footage and disjointed animation. For instance, the training methods cannot learn information that might only be inferred by a human watching a video.
What can go wrong?
Make-A-Video isn’t yet available to the public. The potential it holds is visible from the preview examples, but there are worrying prospects, as with every machine learning model. A California Democrat, Anna Eshoo, expressed some of those concerns, noting in a September letter that Stable Diffusion was used “to create photos of violently beaten Asian women and pornography depicting real people”.
The Meta research team preemptively scrubbed the Make-a-Video training dataset of any NSFW imagery as well as toxic phrasing. But the opportunity for misuse of Make-a-Video is not a small one. The output of these tools could be vastly used for misinformation and propaganda.
Earlier this year, a group of researchers from Tsinghua University and the Beijing Academy of Artificial Intelligence (BAAI) released the only other publicly available text-to-video model named CogVideo. Unfortunately, the model has the same limitations as Meta’s recently released model.
The tool is not available to the public yet, but you can sign up here to get on the list for any form of access later.