Sora is a diffusion model21,22,23,24,25; given input noisy patches
(and conditioning information like text prompts), it’s trained to predict the original “clean” patches.
Importantly, Sora is a diffusion transformer.26 Transformers have demonstrated remarkable scaling
properties across a variety of domains, including language modeling,13,14 computer vision,15,16,17,18
and image generation.27,28,29
In this work, we find that diffusion transformers scale effectively
as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as
training progresses. Sample quality improves markedly as training compute increases.