Key Takeaways:

  • OpenAI’s latest model, Sora, showcases impressive capabilities in text-to-video generation, enabling the creation of minute-long videos with high visual fidelity, reasonably consistent logic, and close adherence to user prompts.
  • Interestingly, for fashion, while hallucinations and uncanny valley effects are still evident in even the best-case Sora-generated videos, the model appears to be able to consistently represent the visual and physical characteristics of soft fabrics – falling below the high bar set for actual fabric simulation, but passing scrutiny for marketing purposes.
  • Despite its impressive features and a multi-year partnership between OpenAI and Shutterstock, there is little known about Sora’s training data, which adds further fuel to the fire of concern around copyright and fair use in AI.

In an age of finger-pointing, if there’s one thing you can’t accuse OpenAI of, it’s fumbling an announcement. After casually dropping ChatGPT on the world in late 2022, and ushering in the tech sector’s biggest competitive sprint in decades, the company recently pulled the curtain aside on Sora, a new and previously-unmentioned text-to-video model that represents a huge leap forward over existing consumer-grade competition like Runway and Pika.

According to OpenAI’s blog post that accompanied the announcement, Sora “can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.”  Outputting at 1080p resolution, the model is able to create “complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background,”  and can also understand how objects “exist in the physical world.”

There are parts of that description that stand on some shaky ground (it’s probably misleading to ascribe “understanding” to this kind of diffusion model) but based on the preview videos being shared by OpenAI itself and its close cohort of testers, it’s fair to say that Sora exhibits a far, far greater ability to infer and represent characters, environments, motion, physics, and other foundational attributes in a way that (mostly) aligns with our human understanding of the world.

sora from openai.
Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

And when we dig a little deeper into the technical literature that OpenAI put out alongside the announcement, the company is – at least right now – optimistic about the idea that the same architecture behind Sora could represent a pathway towards creating “general purpose simulators of the physical world.”

And, going further, some experts have already pointed out that they believe Sora has acquired an “implicit” understanding of physics by being trained on a gigantic corpus of video data that captures the real world. According to a Senior Research Scientist at NVIDIA, Sora “learns intricate rendering, “intuitive” physics, and long-horizon consistency, all by some denoising and gradient maths.”

This is, in a logical sense, a very difficult thing to parse or predict. It’s certainly true that the transformer architecture has taken generative AI a long distance, and it clearly has more runway in front of it (excuse the pun). But the idea that inference and representation can, through scaling, become simulation and knowledge is an uncomfortable and improbable one.

sora from openai.
Prompt: A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.

Is it feasible for an understanding of physics to be an emergent property of a model trained to create believable video out of noise? That seems, on the surface, to be a complete reversal of how we understand knowledge and intelligence, but let’s be real: there’s a lot about the generative AI goldrush that’s challenging assumptions in both of those areas.

And it’s difficult to argue with the results: you’ll find issues with every Sora demo video if you look hard enough, and those issues speak to a lack of physical understanding (people disappearing into walls, characters emerging out of thin air etc.), but the best Sora videos showcase extremely accurate material behaviours. Videos like this one (from OpenAI’s official TikTok account) stretch believability in obvious ways – it’s a dog playing PC games! – and subtle ones – the blue light is seemingly coming from behind the display, and bypassing it – but the behaviour of the hoodie in the video is more than accurate enough to pass muster for marketing purposes.

So if we run that argument forwards with all the concerns we’ve just raised, and land in a scenario where we can accurately simulate light, materials, soft body and fabric physics through a scaled-up version of this kind of architecture, is that sort of opaque, black box simulation something we actually want? That might seem like a big thing to be pondering when most people are thinking about creating short marketing videos, but it may be something the industry (and the world) needs to reckon with in the future.

This is precisely the kind of philosophical question The Interline is going to pick up in our upcoming AI Report, but back to the here and now: in addition to being able to generate a video solely from text instructions, Sora can take an existing still image and generate a video from it, animating the image’s contents with accuracy and attention to detail. The model can also take an existing video and extend it or fill in missing frames, and even combine two standalone videos into a single shot.

sora from openai.
Prompt: A young man at his 20s is sitting on a piece of cloud in the sky, reading a book.

So perhaps you, like us at The Interline, are eager to take Sora for a spin and uncover where the limits really lie. Unfortunately, as yet, Sora is only available to “red teamers” – domain experts in areas like misinformation, hateful content, and bias  – who are assessing the model for potential harms and risks. OpenAI is also offering access to some visual artists, designers, and filmmakers for comments on how to advance the model to be most helpful for creative professionals.

The company also plans to include C2PA metadata if the model is deployed in an OpenAI product, as OpenAI did with its text-to-image tool DALL-E 3 earlier this month.

Confining ourselves to speculating just about the near-term, then, how useful is this all going to be to fashion? The answer here is largely going to depend on both the solutions that are built on the model going forward, as well as what the foundations of Sora actually are.

If OpenAI does roll Sora out in consumer and enterprise products, which is highly likely based on the release cadence of ChatGPT, its off-the-shelf iteration is not likely to be instantly valuable to brands and retailers because it won’t have been trained on their data – transporting us right back into the copyright quagmire that has plagued models like Midjourney.

sora from openai.
Prompt: Extreme close up of a 24 year old woman’s eye blinking, standing in Marrakech during magic hour, cinematic film shot in 70mm, depth of field, vivid colors, cinematic

But it’s a short hop from an off-the-shelf version of this model (or one like it) to a custom one that can be trained on brand-specific products, as well as things like previous campaigns of a particular fashion company. And when it comes to training on something as granular as individual products, ControlNet and the LoRA method could offer a compelling way to supplant traditional finetuning and arrive at an accurate, on-brand representation for static images and video – sidestepping the problem of hallucinations to some extent.

But on that point – just what is Sora’s model trained on? This answer is not known, but it will determine just how precarious a foundation this model has. In other words, if Sora was trained on more than just open-source data and the fruits of OpenAI’s multi-year training partnership with Shutterstock, then this could all be just one big copyright lawsuit away from essentially falling apart.

And this is something that until now, the AI community has – at least publicly – not wanted to confront, working on the assumption that established “fair use” structures will stand up in this entirely new era. But this week, an article came out urging (read also: threatening) AI companies to do exactly this, as they may face “legal peril” in the coming years. One of the key reasons being they may not be able to rely on the fair use precedent (notably not a precedent in the legal sense, since fair use is decided case-by-case) set by Google back in 2004.

sora from openai.
Prompt: Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.

Twenty years ago this year, Google took on the ambitious task of digitising millions of books for inclusion in its book search engine. But the project was met with legal opposition from authors and publishers who claimed that copying copyrighted works without permission was unlawful. Google defended its actions by citing the principle of fair use. But the factors that Google relied on – specifically the nature of the use, and how use impacts the market for original work – may not have the same result when applied to AI companies who are, for all intents and purposes, training on data from companies whose offers they then directly compete with.

The question of fair use is also swirling around in the recent New York Times versus OpenAI lawsuit, which centres on the Times accusing OpenAI of infringing copyright by training its large language models (LLMs) with Times content, and with the accusation that ChatGPT could, on command, replicate entire NYT works – suggesting that the model does “contain” by some definition a copy of those original works.

What makes this case intriguing is that it differs from proceedings in similar cases around fair use, as it focuses on the value of the training data and a new question relating to reputational damage. As a defence to fair use, the Times intends to use what it claims to be the accuracy, trustworthiness, and prestige of its reporting – all of which create a particularly valuable dataset which was not trained on under any kind of commercial agreement. And in terms of reputational damage, the Times claims that harm is also being done to its brand through the “hallucinations” that ChatGPT produces: that readers may rely on the information thinking that it is trustworthy when it may be false.

The issue at hand, as you would probably expect given that we’ve spent this week’s analysis talking about the nature of knowledge and the gulf between representation and understanding, AI is taking regular, massive leaps forward, far surpassing the impact of an invention like Google Books. AI is ready to turn our comprehension of copyright upside down, alongside the compensation mechanisms for content owners.

The best from The Interline:

Dakota Murphey rethinks retail planning, and asks how automation is augmenting outdated processes.

We talk to the CEO of Optitex on why the intersection of art and science – and the pursuit of accuracy and precision – will be vital to realising the vision for 3D.

Our Editor-in-Chief asks where 3D and digital product creation are going? And what will it look like when we get there?

From digital twins to data democratisation, Mark Harrop explores the disruptive technologies reshaping the fashion value-chain.

We talk to PTTRNS’ Director of Digital Strategy on the importance of thinking big, starting small, and learning fast when it comes to realising the extended value of 3D and digital product creation.