<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Computer Vision on MLLog.dev</title><link>https://mllog.dev/en/tags/computer-vision/</link><description>Recent content in Computer Vision on MLLog.dev</description><image><title>MLLog.dev</title><url>https://mllog.dev/images/default_mllog.png</url><link>https://mllog.dev/images/default_mllog.png</link></image><generator>Hugo -- 0.147.9</generator><language>en</language><lastBuildDate>Tue, 17 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://mllog.dev/en/tags/computer-vision/index.xml" rel="self" type="application/rss+xml"/><item><title>Demystifying Video Reasoning: Models Don't Think in Frames - They Think in Denoising Steps</title><link>https://mllog.dev/en/posts/demystifying-video-reasoning-chain-of-steps/</link><pubDate>Tue, 17 Mar 2026 00:00:00 +0000</pubDate><guid>https://mllog.dev/en/posts/demystifying-video-reasoning-chain-of-steps/</guid><description>&lt;p>Video generation models like Sora can solve mazes, manipulate objects, and answer math questions - all by generating video. But &lt;strong>how&lt;/strong> do they reason? The intuitive answer: step by step, frame by frame, like a person drawing a solution on a whiteboard.&lt;/p>
&lt;p>That answer is wrong.&lt;/p>
&lt;p>The paper &lt;strong>&amp;ldquo;Demystifying Video Reasoning&amp;rdquo;&lt;/strong> shows that reasoning in video diffusion models doesn&amp;rsquo;t unfold across frames. It unfolds &lt;strong>across denoising steps&lt;/strong> - the iterative process that turns noise into a coherent video. The authors call this &lt;strong>Chain-of-Steps (CoS)&lt;/strong>, and it fundamentally changes how we understand what these models are doing.&lt;/p></description></item><item><title>Seoul World Model: AI That Generates Video of Real Cities From Street Photos</title><link>https://mllog.dev/en/posts/seoul-world-model-city-scale-video-generation/</link><pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate><guid>https://mllog.dev/en/posts/seoul-world-model-city-scale-video-generation/</guid><description>&lt;p>What if you could fly a virtual camera through any street in a real city — not a game engine, not a pre-recorded video, but a freshly generated, photorealistic view based on actual street photos?&lt;/p>
&lt;p>That&amp;rsquo;s exactly what the &lt;strong>Seoul World Model (SWM)&lt;/strong> does. The paper &lt;strong>&amp;ldquo;Grounding World Simulation Models in a Real-World Metropolis&amp;rdquo;&lt;/strong> introduces a city-scale &lt;span class="glossary-term" tabindex="0">
&lt;span class="glossary-word">world model&lt;/span>
&lt;span class="glossary-tooltip">
&lt;strong>world model&lt;/strong>
&lt;span class="glossary-def">A neural network that learns the dynamics and visual appearance of an environment, allowing it to &amp;lsquo;imagine&amp;rsquo; new views and trajectories it has never seen directly.&lt;/span>
&lt;/span>
&lt;/span>
that generates video grounded in real geography — not in imagined scenes.&lt;/p></description></item></channel></rss>