The generative AI company behind ChatGPT and DALL-E has a new toy: Sora, a text-to-video model that can (sometimes) generate pretty convincing 60-second clips from prompts like «a stylish woman walks down a Tokyo street...» and «a movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet...»
A lot of the AI video generation we've seen so far fails to sustain a consistent reality, redesigning faces and clothing and objects from one frame to the next. Sora, however, «understands not only what the user has asked for in the prompt, but also how those things exist in the physical world,» says OpenAI in its announcement post (using the word «understands» loosely).
The Sora clips are impressive. If I weren't looking closely—say, I was just scrolling past them on social media—I'd probably think many of them were real. The prompt «a Chinese Lunar New Year celebration video with Chinese Dragon» looks at first like typical documentary footage of a parade. But then you realize that the people are oddly proportioned, and seem to be stumbling—it's like the moment in a dream when you suddenly notice that everything is a little bit wrong. Creepy.
«The current model has weaknesses,» writes OpenAI. «It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect. For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark. The model may also confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory.»
My favorite demonstration of Sora's weaknesses is a video in which a plastic chair begins morphing into a Cronenberg lifeform. Behold:
Sora is not currently available to the public, and OpenAI says it's assessing social risks of the model and working on mitigating them, for
Read more on pcgamer.com