OpenAI Sora: Are we all seeing the same thing?
One of the most hyped AI announcements this year was when OpenAI announced their proprietary video generation model Sora last month. The hype has resurfaced this week when OpenAI shared a new short series of videos generated by various creative teams. It does look like Sora is a generational leap over existing video generation models such as RunwayML’s Gen-2 and Stable Video. But by how much?
Even more so than other new generative AI technologies, I feel like the hype versus reality gap with Sora is particularly large. Let’s dive in and talk about what Sora is, what it is not and what it will likely be good at (at least initially) based on what is publicly known.
What is Sora?
Sora is an OpenAI proprietary video generation model currently in the research phase with limited availability.
Sora is a transformer based diffusion model, so in the same category as popular image generation models such as Stable Diffusion and video generation models such as Stable Video.
Video clips are generated by open ended text prompts, so similar to other text, image and video generation models. Sora is capable of generating video from images or existing videos, but it is not clear if that is part of the current private test.
Video clips can be generated up to 1 minute in length. This is a big improvement over existing models, which are typically a few seconds long at most.
Sora was trained on publicly available and licensed videos. OpenAI has not disclosed exactly what publicly available videos were in the training set, which to be fair is true of many other large models, both “open” and “closed.”
What does Sora not do?
Audio generation, including both music and voice generation
Generate guaranteed coherent and consistent output
The many other typical tasks that are part of high end video production, such as editing and color correction
Let’s take a closer look
Since Sora is not yet released to a wider audience and exact timelines are not yet available, the best we can do right now is to examine the example videos published by OpenAI.
First, let’s take a look at “Air Head”. This is a short story about the life of a man who literally has a balloon for a head. This is the only example video that has any sort of overarching narrative, but note that the dialogue as written, audio narration and editing were all done outside of Sora. Hundreds of prompts could have been tested for each final shot as well. In other words, the quality of the video overall likely has a huge amount to do with the skills of the production studio shy kids and not just Sora.
While it’s impressive overall, let’s take a look at a few major issues. First, let’s look at consistency.
Consistency Issues
You won’t notice many of the same types of consistency issues with the other published videos because most do not feature subject consistency between shots. Here is one that I spotted though, from a video created by Nik Kleverov of Native Foreign.
Generation Issues
Going back to “Air Head”, let’s look at some overall generation quality issues.
And again let’s examine the video created by Nik Kleverov of Native Foreign.
What will Sora be good for?
Let’s first hear from the production teams that have had the chance to work with Sora:
Sora is at its most powerful when you’re not replicating the old but bringing to life new and impossible ideas we would have otherwise never had the opportunity to see.
Paul Trillo, Director
As great as Sora is at generating things that appear real - what excites us is its ability to make things that are totally surreal.
shy kids
Sora has opened up the potential to bring to life ideas I've had for years, ideas that were previously technically impossible,” she states. “The ability to rapidly conceptualize at such a high level of quality is not only challenging my creative process but also helping me evolve in storytelling. It's enabling me to translate my imagination with fewer technical constraints.
Josephine Miller, Creative Director, Oraar Studio
These three quotes all tell a similar story: creative experimentation versus reproductions of reality. On the flipside, what’s implied by these quotes is what we have noted above already: Sora makes a lot of easily spotted mistakes working with realistic assets.
What can we extrapolate as potential uses for Sora, at least initially?
Creative experimentation, as noted above
Typical licensed b-roll footage
Advanced storyboarding
Attention grabbing visuals. Sora will be highly useful to quickly create unique, unexpected subjects. Not every video production needs to be a cinematic masterpiece that is part of an amazing artistic narrative.
What’s missing
Let’s first hear from the OpenAI Sora team, as featured on Waveform:
Noteable timestamps:
7:00 - The Sora team says that are looking into allowing for more controls over generation vs simple text prompts
12:00 - The Sora team talks a bit more about tools
Tooling is the biggest missing ingredient in my mind. But then the question becomes, does OpenAI have the resources and expertise to build a suite of powerful and nuanced controls to allow for true generative control? And if not, will the model allow for interoperability so that other parties can develop a tooling ecosystem around Sora? This could potentially exist in the same way that tooling has cropped up around Stable Diffusion (and is missing from OpenAI’s Dall-e) for image generation, but that would require deeper options for integration besides a basic API. Without answers here, consistency issues seem impossible to overcome, even if generation quality continues to improve. Use cases will necessarily be limited.
What’s next
Let me reiterate that Sora is an amazing accomplishment and looks like it really is a major step forward for foundational video generation models. With that being said, I think it’s clear that between the inherent drawbacks of the model by the way of its architecture and the lack of powerful and fine grained controls, Sora will not replace the vast majority of video generation tasks. At least not right away and not just by itself. In the same way that agents and tooling are likely the path forward for best leveraging text and image generation models, similar advances will be required for video as well. In the immediate term, even far inferior open models with burgeoning tooling could offer more practial utility than Sora’s superior raw generation capabilities.