Google has introduced Lumiere, a text-to-video diffusion model designed to synthesise videos, creating realistic, diverse, and coherent motion. Unlike existing models, Lumiere generates entire videos in a single, consistent pass, thanks to its cutting-edge Space-Time U-Net architecture.
It is designed to empower users to create visual content creatively, allowing the generation of realistic or surrealistic video clips up to five seconds in length.
It can animate still images, respond to natural language text prompts, and perform advanced video inpainting. It is built on a Space-Time U-Net architecture and a text-to-image (T2I) model operating in the pixel space, requiring a spatial super-resolution module for high-resolution image production.
Furthermore, Lumiere offers stylised generation, allowing it to generate videos in the target style using a single reference image. This is achieved by leveraging fine-tuned text-to-image model weights. The model can animate still images or portions of them, filling in missing areas with high-quality results.
Despite its limitations, such as not being designed to generate videos with multiple shots or scenes involving diverse motion, Lumiere represents a significant advancement in text-to-video AI generation. The project is currently a research project, and its release for broader use may be subject to addressing various policy considerations.
As of today, OpenAI does not have a publicly available video generation model on their API. However, they are actively researching and developing technology in this area, and there are hints that something might be in the works with the release of GPT-5.