Trending-skills seoul-world-model

```markdown

install

source · Clone the upstream repo

git clone https://github.com/Aradotso/trending-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/Aradotso/trending-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/seoul-world-model" ~/.claude/skills/aradotso-trending-skills-seoul-world-model && rm -rf "$T"

manifest: skills/seoul-world-model/SKILL.md

source content

---
name: seoul-world-model
description: Skill for using the Seoul World Model — a world simulation model grounded in a real-world metropolis (Seoul) by Naver AI
triggers:
  - seoul world model
  - world simulation model
  - grounding world model in real city
  - street view world model
  - naver seoul simulation
  - urban world model inference
  - seoul street view generation
  - metropolis world model
---

# Seoul World Model

> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.

## What Is Seoul World Model?

**Seoul World Model** (by Naver AI) is a research project that grounds world simulation models in real-world urban data from Seoul, South Korea. It enables:

- **World simulation**: Generate realistic video continuations of street-level scenes in Seoul
- **Street-view interpolation**: Synthesize smooth video transitions between street-view frames
- **Urban scene understanding**: Leverage a large-scale real-world metropolis dataset for training/evaluation

The project provides:
- Model checkpoints for world simulation inference
- Synthetic training data (Seoul street-view)
- Street-view interpolation model code and checkpoints

> ⚠️ **Note**: As of March 2026, the repository is undergoing internal review. Model checkpoints, inference code, and training data are planned for release. Monitor the [project page](https://seoul-world-model.github.io/#tldr) and repository for updates.

---

## Installation

### Clone the Repository

```bash
git clone https://github.com/naver-ai/seoul-world-model.git
cd seoul-world-model

Python Environment (Recommended)

# Create and activate a conda environment
conda create -n seoul-world-model python=3.10 -y
conda activate seoul-world-model

# Install dependencies (once requirements.txt is released)
pip install -r requirements.txt

Common Deep Learning Dependencies (Anticipated)

Based on the project type (video generation / world models), install:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install diffusers transformers accelerate
pip install einops timm imageio[ffmpeg] opencv-python
pip install numpy pillow tqdm

Project Structure (Anticipated)

seoul-world-model/
├── README.md
├── checkpoints/          # Model weights (to be released)
├── data/                 # Synthetic training data (to be released)
├── inference/            # Inference scripts (to be released)
│   ├── world_model.py
│   └── interpolation.py
├── train/                # Training code (to be released)
├── configs/              # Model and training configs
└── utils/                # Utilities

Key Concepts

Component	Description
World Simulation Model	Generates future video frames conditioned on current observations and actions
Street-View Interpolation	Fills in smooth transitions between sparse street-view keyframes
Seoul Dataset	Large-scale real-world urban driving/walking data from Seoul
Grounding	Training on real-world data to improve simulation realism and physical plausibility

Inference (Anticipated API Pattern)

Once released, inference will likely follow this pattern:

World Model Inference

import torch
from PIL import Image

# Load model (path subject to change on release)
# from inference.world_model import SeoulWorldModel

# model = SeoulWorldModel.from_pretrained("checkpoints/world_model")
# model = model.to("cuda").eval()

# Prepare input frames
def load_frames(image_paths: list[str]) -> torch.Tensor:
    from torchvision import transforms
    transform = transforms.Compose([
        transforms.Resize((256, 512)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
    ])
    frames = [transform(Image.open(p).convert("RGB")) for p in image_paths]
    return torch.stack(frames).unsqueeze(0)  # (1, T, C, H, W)

# context_frames = load_frames(["frame_000.jpg", "frame_001.jpg", "frame_002.jpg"])

# Run generation
# with torch.no_grad():
#     generated_frames = model.generate(
#         context=context_frames.cuda(),
#         num_frames=16,
#         guidance_scale=7.5,
#     )

# Save output
# save_video(generated_frames, "output.mp4", fps=10)

Street-View Interpolation

# Interpolate between two keyframe images
# from inference.interpolation import StreetViewInterpolator

# interpolator = StreetViewInterpolator.from_pretrained("checkpoints/interpolation")
# interpolator = interpolator.to("cuda").eval()

from PIL import Image
import torch
from torchvision import transforms

def preprocess_image(path: str, size=(256, 512)) -> torch.Tensor:
    transform = transforms.Compose([
        transforms.Resize(size),
        transforms.ToTensor(),
        transforms.Normalize([0.5]*3, [0.5]*3),
    ])
    return transform(Image.open(path).convert("RGB")).unsqueeze(0)

# frame_a = preprocess_image("frame_start.jpg").cuda()
# frame_b = preprocess_image("frame_end.jpg").cuda()

# with torch.no_grad():
#     interpolated = interpolator.interpolate(
#         frame_a, frame_b,
#         num_intermediate=8,
#     )
# save_video(interpolated, "interpolated.mp4", fps=8)

Utility: Save Video

import imageio
import numpy as np
import torch

def save_video(frames: torch.Tensor, output_path: str, fps: int = 10):
    """
    Save a tensor of frames as an MP4 video.
    
    Args:
        frames: Tensor of shape (T, C, H, W) in range [-1, 1] or [0, 1]
        output_path: Path to save .mp4
        fps: Frames per second
    """
    # Denormalize if in [-1, 1]
    if frames.min() < 0:
        frames = (frames + 1) / 2
    
    frames_np = (frames.clamp(0, 1).permute(0, 2, 3, 1).cpu().numpy() * 255).astype(np.uint8)
    
    with imageio.get_writer(output_path, fps=fps, codec="libx264", quality=8) as writer:
        for frame in frames_np:
            writer.append_data(frame)
    
    print(f"Saved video to {output_path}")

Configuration (Anticipated)

World model configs will likely be YAML-based:

# configs/world_model.yaml (example structure)
model:
  type: "SeoulWorldModel"
  checkpoint: "checkpoints/world_model/model.ckpt"
  image_size: [256, 512]
  num_frames: 16
  temporal_stride: 2

inference:
  guidance_scale: 7.5
  num_inference_steps: 50
  seed: 42
  device: "cuda"

data:
  context_frames: 3
  fps: 10

Load config in Python:

import yaml

def load_config(config_path: str) -> dict:
    with open(config_path, "r") as f:
        return yaml.safe_load(f)

config = load_config("configs/world_model.yaml")
print(config["model"]["checkpoint"])

Environment Variables

# Set GPU device
export CUDA_VISIBLE_DEVICES=0

# Set checkpoint directory (if configurable via env)
export SEOUL_WM_CHECKPOINT_DIR=/path/to/checkpoints

# For HuggingFace model downloads (if applicable)
export HF_HOME=/path/to/hf_cache
export HUGGINGFACE_HUB_TOKEN=$HF_TOKEN  # do NOT hardcode tokens

Common Patterns

Batch Inference Over a Dataset

import os
from pathlib import Path

def batch_infer(input_dir: str, output_dir: str, model, batch_size: int = 4):
    input_dir = Path(input_dir)
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    
    scene_dirs = sorted([d for d in input_dir.iterdir() if d.is_dir()])
    
    for scene_dir in scene_dirs:
        frames = sorted(scene_dir.glob("*.jpg"))
        if len(frames) < 3:
            continue
        
        context = load_frames([str(f) for f in frames[:3]])
        
        with torch.no_grad():
            output = model.generate(context.cuda(), num_frames=16)
        
        out_path = output_dir / f"{scene_dir.name}_generated.mp4"
        save_video(output.squeeze(0), str(out_path))
        print(f"Processed: {scene_dir.name}")

Evaluate Temporal Consistency (FVD-style)

import torch
import torch.nn.functional as F

def compute_frame_similarity(generated: torch.Tensor) -> float:
    """
    Simple temporal consistency metric: average cosine similarity between adjacent frames.
    generated: (T, C, H, W)
    """
    T = generated.shape[0]
    similarities = []
    for t in range(T - 1):
        f1 = generated[t].flatten()
        f2 = generated[t + 1].flatten()
        sim = F.cosine_similarity(f1.unsqueeze(0), f2.unsqueeze(0)).item()
        similarities.append(sim)
    return sum(similarities) / len(similarities)

Troubleshooting

CUDA Out of Memory

# Reduce resolution or use mixed precision
import torch

# Use bfloat16
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
    output = model.generate(context.cuda(), num_frames=8)  # reduce frames

# Or use CPU offloading (if supported by model)
# model.enable_model_cpu_offload()

Repository Still Under Review

The codebase is not yet fully released (as of March 2026). Monitor:

# Check for updates
cd seoul-world-model
git fetch origin
git log --oneline origin/main

# Watch GitHub releases
# https://github.com/naver-ai/seoul-world-model/releases

Dependency Conflicts

# If torch conflicts arise, install in order
pip install torch==2.2.0 torchvision==0.17.0 --index-url https://download.pytorch.org/whl/cu121
pip install diffusers==0.27.0 transformers==4.39.0

Resources

Project Page: https://seoul-world-model.github.io/#tldr
GitHub: https://github.com/naver-ai/seoul-world-model
Paper: "Grounding World Simulation Models in a Real-World Metropolis"

Release Checklist (Track Progress)

Model checkpoints and inference code
Synthetic training data
Street-view interpolation model code and checkpoints
Training scripts

Stay tuned to the repository for updates as these are released.