Listen to this story
Of late, RunwayML has enjoyed somewhat of a runaway success with the trend of AI-generated videos growing exponentially. From pizza commercials to mockups of early 2000s home videos and short films, text-to-video is quickly becoming the new paradigm of generative AI.
In line with the trend, Stability AI has released a new SDK for Stable Diffusion that will allow for the creation of animations. With the SDK, users can prompt with just text, an image with text, or a video with text to create output animations. What began with Meta’s Make-A-Video, has now become the new frontier of generative AI algorithms. However, a few key players are suspiciously missing from the lineup.
Too little, too late
The new release by Stability AI is a software development kit which works with Stable Diffusion 2.0 and Stable Diffusion XL. The SDK has the capability to influence the output through a variety of parameters — from general purpose parameters like style presets, cadence, and FPS (frames per second) to more in-depth parameters to influence characteristics like colours, 3D depths, and post processing.
While this SDK is a good step forward for Stability AI, it seems they are late to the party. Similar solutions, built on Stability’s own models, have existed in the market for a while now. Deforum, an online community of AI image creators and artists, has created a web demo for text-to-animation. However, Deforum is fairly basic, as it just melds similar images generated by SD into each other, creating the illusion of an animation.
The true competitor to the Stable Animation SDK is RunwayML’s Gen-2, a text-to-video service. This new model, whose paper is yet to be released, builds upon Gen-1’s capabilities of style transfer and video modifications to generate video from just a text prompt. Similar to the Stable Animation SDK, users can use a text, images, or videos as a prompt to generate videos from scratch.
While RunwayML’s Gen-2 can only be accessed through a waitlist, it is a complete product which can be used without any technical knowledge. The Stable Animation SDK, on the other hand, is targeted at developers who wish to multiply the capabilities of Stable Diffusion’s models.
Even as video generation is emerging as the next big genAI technology, it seems that many of the companies that capitalise on text-to-image are nowhere to be found.
RunwayML: The new DALL-E?
Early last year, OpenAI released DALL-E 2, an image generation algorithm, which kickstarted a wave of innovation. Then came Midjourney, Stable Diffusion, Imagen, and more, catapulting generative AI into the mainstream. However, with the innovations surrounding text-to-video, a lot of these companies have stayed silent, especially OpenAI.
With the release of ChatGPT, and subsequently GPT-4, it seems that OpenAI is content with grooming its golden goose. As such, we have not seen any improvements to DALL-E, apart from its integration into Bing Chat. There is also no talk about any text-to-video model from the AI giant, counting it out of the newest wave of innovation.
Midjourney has also not provided any information on possible text-to-video algorithms, instead choosing to focus on increasing its market lead by adding new features to their image generator. However, it seems that research is leading to innovation, as it did just before the explosion of text-to-image models.
Meta’s AI research wing released a paper in September last year that detailed the approach to generating video without the need for text-video data pairs. Similarly, ByteDance, the company behind TikTok, also released a research paper harnessing the power of diffusion models to generate videos. While both these models have not been released to the public, research shows that the idea behind these approaches are sound —— backed up by the variety of generated videos on their websites.
Google, in collaboration with the Korea Advanced Institute of Science and Technology, followed suit with a paper on projected latent video diffusion models. However, this paper was also published with code, allowing for the replication of this approach. Building on the concept of feature-to-video diffusion models, a team from Alibaba released ModelScope on HuggingFace, which is open for all to use. This is the only service, apart from Deforum, that is open for use.
While the text-to-video market is still in its infancy, the AI-generated commercials show but an inkling of what is possible with video-generating algorithms. Meta has also released a set of generative AI tools targeted at advertisers on the platforms, so it is not implausible to think that Make-A-Video can be integrated into this in the future. Just as with any generative AI solution, the potential for innovation is boundless.