Computer Vision on MLLog.dev

Demystifying Video Reasoning: Models Don't Think in Frames - They Think in Denoising Steps

Tue, 17 Mar 2026 00:00:00 +0000

Video generation models like Sora can solve mazes, manipulate objects, and answer math questions - all by generating video. But how do they reason? The intuitive answer: step by step, frame by frame, like a person drawing a solution on a whiteboard.

That answer is wrong.

The paper “Demystifying Video Reasoning” shows that reasoning in video diffusion models doesn’t unfold across frames. It unfolds across denoising steps - the iterative process that turns noise into a coherent video. The authors call this Chain-of-Steps (CoS), and it fundamentally changes how we understand what these models are doing.

Seoul World Model: AI That Generates Video of Real Cities From Street Photos

Mon, 16 Mar 2026 00:00:00 +0000

What if you could fly a virtual camera through any street in a real city — not a game engine, not a pre-recorded video, but a freshly generated, photorealistic view based on actual street photos?

That’s exactly what the Seoul World Model (SWM) does. The paper “Grounding World Simulation Models in a Real-World Metropolis” introduces a city-scale world model world model A neural network that learns the dynamics and visual appearance of an environment, allowing it to ‘imagine’ new views and trajectories it has never seen directly. that generates video grounded in real geography — not in imagined scenes.