Video generation models like Sora can solve mazes, manipulate objects, and answer math questions - all by generating video. But how do they reason? The intuitive answer: step by step, frame by frame, like a person drawing a solution on a whiteboard.
That answer is wrong.
The paper “Demystifying Video Reasoning” shows that reasoning in video diffusion models doesn’t unfold across frames. It unfolds across denoising steps - the iterative process that turns noise into a coherent video. The authors call this Chain-of-Steps (CoS), and it fundamentally changes how we understand what these models are doing.
Background: Video as a Reasoning Medium
Recent work - notably Thinking with Video (2511.04570) - demonstrated that video generation models can solve reasoning tasks. Sora-2 achieved 40% on geometric puzzles (beating VLMs), 98.9% on GSM8K math, and 92% on MATH-500, all by generating video that visually works through the problem.
The assumed explanation was Chain-of-Frames (CoF): the model reasons sequentially, with each frame building on the previous one, like a flip-book proof. Frame 1 sets up the problem, frame 10 takes the first step, frame 30 reaches the answer.
Intuitive. Elegant. And mostly wrong.
Chain-of-Steps: Where Reasoning Actually Happens
The Core Discovery
The authors probe the internal mechanics of diffusion models diffusion models Generative models that learn to create data by reversing a noise-addition process. Starting from pure noise, they iteratively ‘denoise’ to produce a coherent output - image or video. during video generation and find something surprising: reasoning happens along the denoising axis, not the temporal axis.
In a diffusion process diffusion process The iterative procedure where a model starts with random noise and progressively refines it over many steps (typically 20-100) into a clean image or video. Early steps establish structure; late steps add detail. , the model starts with pure noise and iteratively refines it over many steps into a clean video. The authors show that:
- Early denoising steps - the model explores multiple candidate solutions simultaneously. The latent representation contains overlapping, contradictory possibilities.
- Middle denoising steps - candidates are progressively pruned. The model narrows toward a specific solution path.
- Late denoising steps - the final answer crystallizes. Details are refined, but the solution is already committed.
This is Chain-of-Steps (CoS): reasoning as progressive convergence through denoising, not sequential construction through frames.
Why It Matters
The distinction is critical. In CoF, each frame is a reasoning step - you need many frames for complex problems, and the model must maintain logical coherence across the temporal axis. In CoS, the reasoning happens “vertically” through denoising depth, and frames are more like different spatial views of the same converged solution.
This means:
- More denoising steps = deeper reasoning, not more frames
- The model doesn’t need long videos to solve complex problems
- Reasoning quality is tied to the denoising schedule denoising schedule The sequence of noise levels used during generation. The schedule determines how much noise is removed at each step, affecting how the model allocates ‘computation’ between structure and detail. , not video length
Emergent Reasoning Behaviors
Beyond CoS, the authors identify three emergent behaviors that arise spontaneously during generation:
1. Working Memory
The model maintains persistent reference points across denoising steps. When solving a maze, it doesn’t rediscover the walls at every step - it establishes their positions early and maintains them as a stable scaffold while exploring paths.
This is analogous to how humans keep relevant constraints “loaded” in working memory while reasoning through a problem.
2. Self-Correction and Enhancement
Perhaps most surprisingly, the model can recover from incorrect intermediate solutions. During early denoising steps, the latent representation may contain an incorrect path through a maze or a wrong digit in a calculation. By the middle steps, these errors can be corrected - without any explicit error-detection mechanism.
The model doesn’t just converge monotonically toward an answer. It explores, makes mistakes, and fixes them - all within the denoising trajectory.
3. Perception Before Action
Early denoising steps establish semantic grounding: understanding the scene, identifying objects, recognizing the problem structure. Only after this perceptual foundation is laid do later steps perform structured manipulation: moving pieces, drawing paths, computing results.
This mirrors the human pattern of “look before you leap” - but it emerges spontaneously from the diffusion process, not from explicit architectural design.
Functional Specialization Within the Transformer
The authors go deeper, probing what happens within a single denoising step inside the Diffusion Transformer Diffusion Transformer A generative architecture combining diffusion models with transformer attention. DiTs have become the dominant architecture for high-quality video generation (Sora, Kling, VEO). (DiT):
| Layer Group | Function | Role |
|---|---|---|
| Early layers | Dense perceptual encoding | Extract visual features, understand scene structure |
| Middle layers | Reasoning execution | Perform logical operations, spatial manipulation |
| Later layers | Representation consolidation | Merge and stabilize the latent into coherent output |
This is a self-evolved functional specialization - the model wasn’t designed with separate perception and reasoning modules. The division of labor emerged from training.
The implication: within each denoising step, the model runs a mini pipeline of perceive → reason → consolidate. Across denoising steps, these mini pipelines progressively refine the solution. It’s reasoning nested within reasoning.
Practical Application: Seed Ensemble
Motivated by the CoS insight, the authors propose a simple, training-free proof-of-concept: run the same model multiple times with different random seeds and ensemble the latent trajectories.
Why does this work? Because:
- Different seeds explore different regions of the solution space during early denoising steps
- Ensembling the latent trajectories combines these explorations
- The result is a broader search over possible solutions before convergence
This is conceptually similar to self-consistency self-consistency A decoding strategy where the same problem is solved multiple times with different random samples, and the most common answer is selected. Used in LLMs (majority voting) and now adapted for diffusion latent trajectories. in LLMs (sample multiple answers, take the majority vote), but operating in the continuous latent space of diffusion models rather than on discrete tokens.
The strategy improves reasoning performance across benchmarks - without any retraining, architectural changes, or additional data.
What This Changes
For Video Model Design
If reasoning lives in denoising depth, then:
- Denoising schedules should be optimized for reasoning, not just visual quality
- Models could allocate more steps to “hard” regions and fewer to “easy” ones
- Architectures that deepen the per-step computation (more layers, more attention) may improve reasoning more than adding frames
For Understanding AI Reasoning
The finding that self-correction emerges spontaneously in diffusion models is striking. LLMs struggle with self-correction - they often can’t identify their own errors without external feedback. Diffusion models, operating in continuous latent space with iterative refinement, seem to develop this capability naturally.
This suggests that iterative refinement in continuous space may be a more natural substrate for certain types of reasoning than autoregressive token generation.
For Practical Applications
- Maze solving, path planning, geometric reasoning: these tasks benefit from deeper denoising, not longer videos
- The seed ensemble strategy is free compute-time improvement applicable to any diffusion model
- Entropy or uncertainty in early steps could predict final reasoning quality - enabling early stopping or resampling
Summary
Chain-of-Steps reframes how video diffusion models reason:
- Reasoning happens along denoising steps, not across frames
- Early steps explore multiple solutions; late steps converge to one
- Three emergent behaviors arise: working memory, self-correction, perception before action
- Within each step, transformer layers self-specialize: perceive → reason → consolidate
- A simple seed ensemble leverages these dynamics for training-free improvement
The deeper message: diffusion models aren’t just pattern matchers that happen to produce plausible video. They develop internal reasoning dynamics - exploring, correcting, and converging - that emerge from the structure of the denoising process itself. Understanding these dynamics is the first step toward harnessing them.
Links
- Based on the publication arXiv:2603.16870
- Project page: wruisi.com/demystifying_video_reasoning
- Related: Thinking with Video (arXiv:2511.04570)