Demystifying Video Reasoning: Models Don't Think in Frames - They Think in Denoising Steps

Video generation models like Sora can solve mazes, manipulate objects, and answer math questions - all by generating video. But how do they reason? The intuitive answer: step by step, frame by frame, like a person drawing a solution on a whiteboard.

That answer is wrong.

The paper “Demystifying Video Reasoning” shows that reasoning in video diffusion models doesn’t unfold across frames. It unfolds across denoising steps - the iterative process that turns noise into a coherent video. The authors call this Chain-of-Steps (CoS), and it fundamentally changes how we understand what these models are doing.

Background: Video as a Reasoning Medium

Recent work - notably Thinking with Video (2511.04570) - demonstrated that video generation models can solve reasoning tasks. Sora-2 achieved 40% on geometric puzzles (beating VLMs), 98.9% on GSM8K math, and 92% on MATH-500, all by generating video that visually works through the problem.

The assumed explanation was Chain-of-Frames (CoF): the model reasons sequentially, with each frame building on the previous one, like a flip-book proof. Frame 1 sets up the problem, frame 10 takes the first step, frame 30 reaches the answer.

Intuitive. Elegant. And mostly wrong.

Chain-of-Steps: Where Reasoning Actually Happens

The Core Discovery

The authors probe the internal mechanics of diffusion models diffusion models Generative models that learn to create data by reversing a noise-addition process. Starting from pure noise, they iteratively ‘denoise’ to produce a coherent output - image or video. during video generation and find something surprising: reasoning happens along the denoising axis, not the temporal axis.

In a diffusion process diffusion process The iterative procedure where a model starts with random noise and progressively refines it over many steps (typically 20-100) into a clean image or video. Early steps establish structure; late steps add detail. , the model starts with pure noise and iteratively refines it over many steps into a clean video. The authors show that:

Early denoising steps - the model explores multiple candidate solutions simultaneously. The latent representation contains overlapping, contradictory possibilities.
Middle denoising steps - candidates are progressively pruned. The model narrows toward a specific solution path.
Late denoising steps - the final answer crystallizes. Details are refined, but the solution is already committed.

This is Chain-of-Steps (CoS): reasoning as progressive convergence through denoising, not sequential construction through frames.

Why It Matters

The distinction is critical. In CoF, each frame is a reasoning step - you need many frames for complex problems, and the model must maintain logical coherence across the temporal axis. In CoS, the reasoning happens “vertically” through denoising depth, and frames are more like different spatial views of the same converged solution.

This means:

More denoising steps = deeper reasoning, not more frames
The model doesn’t need long videos to solve complex problems
Reasoning quality is tied to the denoising schedule denoising schedule The sequence of noise levels used during generation. The schedule determines how much noise is removed at each step, affecting how the model allocates ‘computation’ between structure and detail. , not video length

Emergent Reasoning Behaviors

Beyond CoS, the authors identify three emergent behaviors that arise spontaneously during generation:

1. Working Memory

The model maintains persistent reference points across denoising steps. When solving a maze, it doesn’t rediscover the walls at every step - it establishes their positions early and maintains them as a stable scaffold while exploring paths.

This is analogous to how humans keep relevant constraints “loaded” in working memory while reasoning through a problem.

2. Self-Correction and Enhancement

Perhaps most surprisingly, the model can recover from incorrect intermediate solutions. During early denoising steps, the latent representation may contain an incorrect path through a maze or a wrong digit in a calculation. By the middle steps, these errors can be corrected - without any explicit error-detection mechanism.

The model doesn’t just converge monotonically toward an answer. It explores, makes mistakes, and fixes them - all within the denoising trajectory.

3. Perception Before Action

Early denoising steps establish semantic grounding: understanding the scene, identifying objects, recognizing the problem structure. Only after this perceptual foundation is laid do later steps perform structured manipulation: moving pieces, drawing paths, computing results.

This mirrors the human pattern of “look before you leap” - but it emerges spontaneously from the diffusion process, not from explicit architectural design.

Functional Specialization Within the Transformer

The authors go deeper, probing what happens within a single denoising step inside the Diffusion Transformer Diffusion Transformer A generative architecture combining diffusion models with transformer attention. DiTs have become the dominant architecture for high-quality video generation (Sora, Kling, VEO). (DiT):

Layer Group	Function	Role
Early layers	Dense perceptual encoding	Extract visual features, understand scene structure
Middle layers	Reasoning execution	Perform logical operations, spatial manipulation
Later layers	Representation consolidation	Merge and stabilize the latent into coherent output

This is a self-evolved functional specialization - the model wasn’t designed with separate perception and reasoning modules. The division of labor emerged from training.

The implication: within each denoising step, the model runs a mini pipeline of perceive → reason → consolidate. Across denoising steps, these mini pipelines progressively refine the solution. It’s reasoning nested within reasoning.

Practical Application: Seed Ensemble

Motivated by the CoS insight, the authors propose a simple, training-free proof-of-concept: run the same model multiple times with different random seeds and ensemble the latent trajectories.

Why does this work? Because:

Different seeds explore different regions of the solution space during early denoising steps
Ensembling the latent trajectories combines these explorations
The result is a broader search over possible solutions before convergence

This is conceptually similar to self-consistency self-consistency A decoding strategy where the same problem is solved multiple times with different random samples, and the most common answer is selected. Used in LLMs (majority voting) and now adapted for diffusion latent trajectories. in LLMs (sample multiple answers, take the majority vote), but operating in the continuous latent space of diffusion models rather than on discrete tokens.

The strategy improves reasoning performance across benchmarks - without any retraining, architectural changes, or additional data.

What This Changes

For Video Model Design

If reasoning lives in denoising depth, then:

Denoising schedules should be optimized for reasoning, not just visual quality
Models could allocate more steps to “hard” regions and fewer to “easy” ones
Architectures that deepen the per-step computation (more layers, more attention) may improve reasoning more than adding frames

For Understanding AI Reasoning

The finding that self-correction emerges spontaneously in diffusion models is striking. LLMs struggle with self-correction - they often can’t identify their own errors without external feedback. Diffusion models, operating in continuous latent space with iterative refinement, seem to develop this capability naturally.

This suggests that iterative refinement in continuous space may be a more natural substrate for certain types of reasoning than autoregressive token generation.

For Practical Applications

Maze solving, path planning, geometric reasoning: these tasks benefit from deeper denoising, not longer videos
The seed ensemble strategy is free compute-time improvement applicable to any diffusion model
Entropy or uncertainty in early steps could predict final reasoning quality - enabling early stopping or resampling

Summary

Chain-of-Steps reframes how video diffusion models reason:

Reasoning happens along denoising steps, not across frames
Early steps explore multiple solutions; late steps converge to one
Three emergent behaviors arise: working memory, self-correction, perception before action
Within each step, transformer layers self-specialize: perceive → reason → consolidate
A simple seed ensemble leverages these dynamics for training-free improvement

The deeper message: diffusion models aren’t just pattern matchers that happen to produce plausible video. They develop internal reasoning dynamics - exploring, correcting, and converging - that emerge from the structure of the denoising process itself. Understanding these dynamics is the first step toward harnessing them.

Background: Video as a Reasoning Medium#

Chain-of-Steps: Where Reasoning Actually Happens#

The Core Discovery#

Why It Matters#

Emergent Reasoning Behaviors#

1. Working Memory#

2. Self-Correction and Enhancement#

3. Perception Before Action#

Functional Specialization Within the Transformer#

Practical Application: Seed Ensemble#

What This Changes#

For Video Model Design#

For Understanding AI Reasoning#

For Practical Applications#

Summary#

Links#