Lumiere and Sora are giving AI video super powers

Lumiere and Sora are giving AI video super powers
Illustration source: unDraw, colorized by me.

While I've talked about AI in general and also focused a little bit on how it's being used to make music, the topic today is certainly one I thought we would have to wait more to see. Quite literally, to see.

Google Research's Lumiere

On January 23th, the Research Arm of Google AI efforts published a video to YouTube announcing Lumiere, what they call "A Space-Time Diffusion Model for Video Generation". What does that mean? Well, from their research paper they tell us it's the following:

... A text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion – a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution – an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

If that's something you can understand clearly, well, you're really smart. From what I can gather, their architecture can understand the whole video after one pass and manage it as if it was only a singular slice of time, while other generative models usually fill the blanks between key frames in a video to the best of their understanding abilities, which often generate weird effects in the frames it creates. While I think the negative from Google Research's approach is that the end result is a low resolution video, it's easy to overlook it because of the benefits, which are the different effects they can apply to it. I think a video is better than a thousand words, so please see below what Lumiere can do to a processed video:

I think my jaw dropped when I saw these results. Literally. The one that had me smiling like a fool was the different styles it applied to the lady running. Can you imagine the creative stuff you could achieve with that? With the correct prompt you might be able to get something really amazing where imagination –and prompt crafting ability– is your only limitation.

I would love to play with these kinds of models to create videos or gifs I could use here on AELO to get a more consistent brand style. However, I don't think it's available to the public, yet, and that's mainly because of what they call "Societal Impact":

Our primary goal in this work is to enable novice users to generate visual content in an creative and flexible way. However, there is a risk of misuse for creating fake or harmful content with our technology, and we believe that it is crucial to develop and apply tools for detecting biases and malicious use cases in order to ensure a safe and fair use.

There are a lot of different examples on their website, which you can visit here.

OpenAI's Sora

I want to say that this was a response to Google's publication, since there is an arms race for companies in the AI space, but I guess we'll never know. Either way, the 17th of February OpenAI publicly announced Sora, their text-to-video model. From their website –edited by me for a better reading flow:

Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt... [It] is able to generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background. The model understands not only what the user has asked for in the prompt, but also how those things exist in the physical world.

The model has a deep understanding of language, enabling it to accurately interpret prompts and generate compelling characters that express vibrant emotions. Sora can also create multiple shots within a single generated video that accurately persist characters and visual style.

And yes, this is quite different from Google Research's approach. Starting from the fact that this is waaaay easier to understand and followed by the fact that, obviously, OpenAI is immediately showcasing this as a product, not research. Leaving that aside, the product itself works differently and has a different scope. Sora is meant to be a straight forward text-to-video creation tool, keyword: creation. You prompt their model to create something out of thin air and it spits a video that's meant to follow nature's rules and be more realistic and, before you see the video, let me tell you that the results are impressive.

After seeing what Sora can do I immediately thought "Holy shit, it could easily make a movie". Did you see the Big Sur example? A part of my brain immediately started itching searching within my memories because it feels like I've seen that scene somewhere before. Then the astronaut with the red hat... Holy molly, that guy looks like a famous actor. And the Tokyo scenery with the sakura trees in full bloom? Chef kiss. The kicker, though? Everything is fake. But it feels real. Like it exists. But it doesn't. Isn't that weird for you, too?

And that's the part where things start feeling dangerous. In that regard, OpenAI says it's taking action:

... Once in an OpenAI product, our text classifier will check and reject text input prompts that are in violation of our usage policies, like those that request extreme violence, sexual content, hateful imagery, celebrity likeness, or the IP of others. We’ve also developed robust image classifiers that are used to review the frames of every video generated to help ensure that it adheres to our usage policies, before it’s shown to the user.

We’ll be engaging policymakers, educators and artists around the world to understand their concerns and to identify positive use cases for this new technology. Despite extensive research and testing, we cannot predict all of the beneficial ways people will use our technology, nor all the ways people will abuse it. That’s why we believe that learning from real-world use is a critical component of creating and releasing increasingly safe AI systems over time.

It's also already working with creators to see how Sora can be most helpful and to advance the model. You can see more examples of what it can do on their website.

Safety check, please.

Imagine these tools in the wrong hands? The internet is already full of fake information and bad actors using made up text and images to confuse us. Well, it isn't like Google or OpenAI aren't aware of it and the nefarious use cases it may entail, so they are actively working and developing safeguards to keep their products and their users safe, but ,at the end of the day, we all have to be realistic: these are corporations looking to make a profit of the use of these kind of products. For example, OpenAI recently changed their policy to allow military applications and, while it seems like Google isn't actively doing it anymore, it has changed its mind before even to its employees dismay. Also, Google's former CEO, Eric Schmidt, is now working on AI-aided suicide drones.

That's not to put all the blame on these companies, though. There are already plenty of open source text-to-video algorithms out there and it's just a matter of time for developers and researchers to make them as capable as Lumiere and Sora, but without an easy way to track who and how these are going to be used.

At the end of the day it's as they say: who will guard the guards?