What is a Video Diffusion Model?
A video diffusion model is a type of generative neural network that creates video by starting from random noise and iteratively removing that noise (denoising) over many steps, guided by a conditioning signal such as a text prompt or image, until coherent video frames emerge. It is the core architecture behind models like Sora 2, Veo 3.1, Kling 3, and Wan 2.6.
How It Works
The diffusion process has two phases. During training, the model learns to reverse a noise-adding process: given a real video, it adds Gaussian noise at increasing levels and trains a neural network to predict and remove that noise at each level. After training, the model can start from pure noise and denoise it step-by-step into a realistic video.
Modern video diffusion models operate in a compressed latent space rather than directly on pixels. A Variational Autoencoder (VAE) first encodes video frames into a lower-dimensional latent representation, reducing the computational cost by 8-64x. The diffusion process runs entirely in this latent space, and the VAE decoder converts the final latent back to pixel video.
The denoising backbone in state-of-the-art models is typically a Diffusion Transformer (DiT). Unlike older U-Net architectures, DiT treats video as a sequence of spacetime patches and applies multi-head self-attention across both spatial and temporal dimensions. This enables the model to maintain consistency across frames — objects keep their shape, lighting stays coherent, and motion flows naturally.
Conditioning signals (text prompts, images) are injected through cross-attention layers. A text encoder (CLIP or T5) converts the prompt into embeddings that guide the denoising at every step. Classifier-free guidance amplifies the influence of the conditioning, producing outputs that more closely match the prompt at the cost of some diversity.
Use Cases
- 1Text-to-video generation — The primary application. Models like Sora 2 and Veo 3.1 use video diffusion to generate clips from text descriptions.
- 2Image animation — Conditioning on a source image produces image-to-video output where the diffusion model generates plausible motion from a static starting point.
- 3Video super-resolution — Diffusion models can upscale low-resolution video by treating the low-res input as a noisy version of the high-res target.
- 4Frame interpolation — Generating intermediate frames between two keyframes to increase frame rate or create slow-motion effects.
Video Diffusion Models on Kensa
Kensa provides access to five video diffusion models, each with different architectures and strengths. Sora 2 uses a DiT backbone with spacetime patches for cinematic realism. Veo 3.1 optimizes for speed with fewer denoising steps. Kling 3 specializes in character motion through enhanced temporal modeling.
You do not need to understand the underlying architecture to use these models — Kensa abstracts the complexity. But understanding diffusion models helps you appreciate why different models produce different results. Try them on the video generator.
Related Terms
Frequently Asked Questions
What is the difference between a diffusion model and a GAN for video?+
How many denoising steps does a video diffusion model use?+
Can video diffusion models generate long videos?+
See Diffusion Models in Action
Try Sora 2, Veo 3.1, Kling 3, and more on Kensa. Free credits, no credit card required.
Start Generating