Back to All Events

ML Benkyoukai: SORA

Whenever there’s an advance in charismatic domains like video generation, a media frenzy always accompanies. In the case of SORA, for once, the media frenzy may be justified; OpenAI’s new model marks a qualitative shift in what is possible when it comes to generating video and puts another strike in favour of brute scaling. Leaving aside the possible impacts on the stock video industry, is there anything interesting we can glean from their technical report? Any secret sauce other than more parameters → more performance?

In this session of the AIIF’s ML Benkyoukai we’ll study the details of video diffusion, deduce as much as we can about SORA’s architecture from what is publicly available, and engage in wild speculation as to what comes next.

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

Brooks, Peebles, et al., “Video generation models as world simulators”, OpenAI 2024

Previous
Previous
27 March

Reinforcement Learning from AI Feedback

Next
Next
2 April

AIIF Masterclass: AI Rules