What if you could fly a virtual camera through any street in a real city — not a game engine, not a pre-recorded video, but a freshly generated, photorealistic view based on actual street photos?

That’s exactly what the Seoul World Model (SWM) does. The paper “Grounding World Simulation Models in a Real-World Metropolis” introduces a city-scale world model world model A neural network that learns the dynamics and visual appearance of an environment, allowing it to ‘imagine’ new views and trajectories it has never seen directly. that generates video grounded in real geography — not in imagined scenes.


The Problem: From Imagination to Reality

Most video generation models operate in an imagined world. They can produce impressive visuals, but those visuals aren’t anchored to any real place. Ask a model to show you a specific street in Seoul, and it will hallucinate something plausible but wrong.

For applications like autonomous driving autonomous driving Self-driving vehicles that require accurate simulations of real-world environments to safely test decision-making without physical risk. simulation, urban planning, and navigation, you need a model that generates spatially faithful video — consistent with real geography, real buildings, real intersections.

The challenges:

  1. Temporal misalignment — street-view reference photos were taken at different times than the target scene
  2. Sparse capture — panoramic images are taken every 5–20 meters, not as continuous video
  3. Error accumulation — over hundreds of meters, small errors compound into drifting, incoherent video

Architecture: How SWM Works

Foundation

SWM fine-tunes Cosmos-Predict2.5-2B, a Diffusion Transformer Diffusion Transformer A generative architecture combining diffusion models (iterative denoising) with transformer attention. Used for high-quality image and video synthesis. with 2 billion parameters, 28 blocks, and 16 attention heads. Video is compressed into a 16-channel latent space latent space A compressed representation of data. Instead of working with raw pixels, the model operates in a lower-dimensional space where each ’latent’ vector encodes meaningful visual information. via a 3D VAE with 4× temporal and 8× spatial compression.

Generation is autoregressive autoregressive Generating output sequentially — each new chunk of video is conditioned on the previously generated chunks, like writing a sentence word by word. : the model produces video in chunks of 77 frames (teacher-forcing) or 12 frames (self-forcing), each conditioned on the previous output.

Dual Referencing System

SWM uses two complementary ways to inject real-world information:

Geometric referencing — the nearest street-view photo is reprojected into the target viewpoint using depth estimation:

$$\mathbf{x}_{warp} = \text{Render}(\text{Unproj}(\mathbf{x}_{ref}, d_{ref}), c_{ref \to t})$$

This warped image is channel-concatenated with the noisy target latent, providing pixel-level geometric grounding.

Semantic referencing — K=5 nearby street-view images are encoded as single latent tokens and appended to the attention sequence. Each target frame can attend to all five references, extracting appearance cues like building facades, road textures, and vegetation.

Camera Encoding

Camera pose is encoded via Plücker ray embeddings Plücker ray embeddings A mathematical representation of camera rays in 3D space using 6D coordinates. Each pixel gets a ray direction and moment vector, telling the model exactly where the camera is looking. — 6D coordinates derived from camera extrinsics and intrinsics, projected through a convolutional encoder and concatenated with latent channels.


Key Innovations

1. Cross-Temporal Pairing

Street-view images are captured at different times — summer vs. winter, day vs. night, with or without parked cars. Instead of treating this as a bug, SWM uses it as a feature.

By deliberately pairing reference images from different timestamps with target sequences, the model learns to extract persistent structure (buildings, roads, landmarks) while ignoring transient content (weather, vehicles, pedestrians). This is the single most impactful design choice — removing it causes the largest degradation across all metrics.

2. View Interpolation Pipeline

Street-view panoramas are captured every 5–20 meters — far too sparse for video training. SWM bridges the gaps with an intermittent freeze-frame strategy: each keyframe is repeated 4 consecutive times, matching the 3D VAE’s temporal stride.

This simple approach significantly outperforms channel concatenation:

MetricFreeze-frameConcatenation
PSNR25.0322.52
SSIM0.7030.628
LPIPS0.1620.245

3. Virtual Lookahead Sink

Over long trajectories (hundreds of meters), autoregressive models accumulate errors. Previous approaches used a fixed first-frame anchor, but this becomes irrelevant as the camera moves far away.

Virtual Lookahead dynamically retrieves the nearest street-view image to the chunk’s endpoint and places it as a future anchor:

$$\mathbf{Z}_{seq} = [\mathbf{Z}_{hist}; \mathbf{Z}; z_{VL}]$$

with temporal offset $\Delta_{VL} = 5$ beyond the generation window. This continuously re-grounds the model to upcoming real-world locations.

On extended 1,460-frame sequences (4× the standard benchmark):

VariantFID ↓mPSNR ↑
Virtual Lookahead25.1313.70
No sink37.3712.94

The performance gap widens substantially over longer trajectories — exactly where it matters most.


Training Data

SWM combines three data sources:

SourceSizeRole
Seoul street-view440K panoramic images (from 1.2M raw)Real-world grounding
CARLA synthetic12.7K videos across 6 mapsTrajectory diversity
WaymoPublic driving videosScenario diversity

Training mix: Waymo 20%, Seoul 40%, synthetic 40%.

Training setup: 10K iterations (teacher-forcing) + 6K ODE initialization (self-forcing) on 24 NVIDIA H100 GPUs, batch size 48, learning rate 4.8e-5 with AdamW.


Results: SWM vs. Six Baselines

The model was evaluated on two benchmarks entirely absent from training:

  • Busan-City-Bench: 30 sequences, 365 frames each (~100m), different Korean city
  • Ann-Arbor-City-Bench: 30 sequences from MARS dataset, a US city

Quantitative Results

MetricSWM (TF)Best BaselineImprovement
FID (Busan)28.4362.14 (Lingbot)2.2× better
FVD (Busan)301.76717.44 (Lingbot)2.4× better
RotErr (Busan)0.0200.044 (HY-World)2.2× better
TransErr (Busan)0.0150.079 (HY-World)5.3× better
FID (Ann Arbor)43.9757.99 (Lingbot)1.3× better

SWM trained exclusively on Seoul outperforms all baselines on both Busan and Ann Arbor without any fine-tuning — demonstrating strong zero-shot generalization zero-shot generalization The ability to perform well on a task or domain never seen during training. Here: a model trained on Seoul generates accurate video of Busan and Ann Arbor. .

Inference Speed

Self-forcing variant runs at 15.2 fps on a single H100 GPU — fast enough for interactive applications.


Ablation Highlights

Removed ComponentFID ImpactTransErr Impact
Cross-temporal pairing+16.31 (worst)+0.108 (worst)
Geometric referencing+4.58+0.051
Synthetic dataFVD +63RotErr +0.001
Semantic referencing+1.84minimal

Cross-temporal pairing is the most critical component — without it, the model overfits to transient details in reference images.


Limitations

  1. No real video training data — SWM trains on interpolated sequences from sparse keyframes, not actual video. Real video would improve temporal coherence.
  2. Dynamic objects — vehicles and pedestrians sometimes appear/disappear abruptly (~5% of cases) because references were captured at arbitrary past times.
  3. Geographic scope — training covers ~44.8 km × 31.0 km within Seoul Metropolitan Area.

Why It Matters

Seoul World Model isn’t just a cool demo. It’s a foundation for:

  • Autonomous driving simulation — test self-driving systems in photorealistic recreations of real cities
  • Urban planning — visualize changes to real neighborhoods before building
  • Navigation — generate previews of routes from any camera angle
  • Digital twins — continuously updated visual models of cities

The key insight: grounding generation in real-world data (street-view photos + GPS coordinates) produces dramatically better results than training on imagination alone. And the model generalizes to cities it has never seen.