Claude-skill-registry Vram-GPU-OOM

GPU VRAM management patterns for sharing memory across services (Ollama, Whisper, ComfyUI). OOM retry logic, auto-unload on idle, and service signaling protocol.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/Vram-GPU-OOM-memory-management" ~/.claude/skills/majiayu000-claude-skill-registry-vram-gpu-oom && rm -rf "$T"

manifest: skills/data/Vram-GPU-OOM-memory-management/SKILL.md

GPU OOM Retry Pattern

Simple pattern for sharing GPU memory across multiple services without coordination.

Strategy

All services try to load models normally
Catch OOM errors
Wait 30-60 seconds (for other services to auto-unload)
Retry up to 3 times
Configure all services to unload quickly when idle

Python (PyTorch / Transformers)

import torch
import time

def load_model_with_retry(max_retries=3, retry_delay=30):
    for attempt in range(max_retries):
        try:
            # Your model loading code
            model = MyModel.from_pretrained("model-name")
            model.to("cuda")
            return model

        except RuntimeError as e:
            if "out of memory" in str(e).lower():
                if attempt < max_retries - 1:
                    print(f"OOM on attempt {attempt+1}, waiting {retry_delay}s...")
                    torch.cuda.empty_cache()  # Clean up
                    time.sleep(retry_delay)
                else:
                    raise  # Give up after max retries
            else:
                raise  # Not OOM, raise immediately

ComfyUI / Flux (Python-based)

Add to your workflow/node:

# In your model loading function
import torch
import time

def load_flux_model(path, max_retries=3):
    for attempt in range(max_retries):
        try:
            # Your Flux/ComfyUI loading code
            model = comfy.utils.load_torch_file(path)
            return model
        except RuntimeError as e:
            if "out of memory" in str(e).lower():
                if attempt < max_retries - 1:
                    print(f"GPU busy, retrying in 30s...")
                    torch.cuda.empty_cache()
                    time.sleep(30)
                else:
                    raise
            else:
                raise

Ollama

Ollama already handles this! Just configure quick unloading:

# In /etc/systemd/system/ollama.service.d/override.conf
Environment="OLLAMA_KEEP_ALIVE=30s"

Shell Scripts

For any GPU command:

#!/bin/bash
MAX_RETRIES=3
RETRY_DELAY=30

for i in $(seq 1 $MAX_RETRIES); do
    if your-gpu-command; then
        exit 0
    fi

    if [ $i -lt $MAX_RETRIES ]; then
        echo "GPU busy, retrying in ${RETRY_DELAY}s..."
        sleep $RETRY_DELAY
    fi
done

echo "Failed after $MAX_RETRIES attempts"
exit 1

Service Signaling Protocol (Optional Enhancement)

For better coordination, services can implement these endpoints:

1. Auto-Unload on Idle

Services can automatically unload models after idle timeout:

# FastAPI example
import asyncio
import time

last_request_time = None
auto_unload_minutes = 5  # configurable

async def auto_unload_task():
    """Background task that unloads model after idle timeout."""
    while True:
        await asyncio.sleep(60)  # Check every minute

        if current_handler is None:
            continue

        idle = time.time() - last_request_time
        if idle > (auto_unload_minutes * 60):
            logger.info(f"Auto-unloading model after {idle/60:.1f} minutes")
            current_handler.unload()
            current_handler = None

@app.on_event("startup")
async def startup():
    asyncio.create_task(auto_unload_task())

2. Request-Unload Endpoint

Allow other services to politely request unload:

@app.post("/request-unload")
async def request_unload():
    """Request model unload if idle."""
    if current_handler is None:
        return {"status": "ok", "unloaded": False, "message": "No model loaded"}

    idle = time.time() - last_request_time

    # Only unload if idle for at least 30 seconds
    if idle < 30:
        return {
            "status": "busy",
            "unloaded": False,
            "message": f"Model in use (idle {idle:.0f}s)",
            "idle_seconds": idle,
        }

    # Unload the model
    logger.info("Unloading on request from another service")
    current_handler.unload()
    current_handler = None

    return {
        "status": "ok",
        "unloaded": True,
        "message": "Model unloaded",
        "idle_seconds": idle,
    }

3. Enhanced Status Endpoint

@app.get("/status")
async def get_status():
    idle = time.time() - last_request_time if last_request_time else None
    return {
        "status": "ok",
        "model_loaded": current_handler is not None,
        "idle_seconds": idle,
        "auto_unload_enabled": auto_unload_minutes is not None,
        "auto_unload_minutes": auto_unload_minutes,
    }

4. Using the Protocol

Before loading a large model, request other services to unload:

import requests

SERVICES = [
    "http://10.99.0.3:8765",  # Invoice OCR
    # Add other services here
]

for service in SERVICES:
    try:
        resp = requests.post(f"{service}/request-unload", timeout=5)
        result = resp.json()
        if result.get("unloaded"):
            print(f"✓ {service} unloaded")
        elif result.get("status") == "busy":
            print(f"⏱ {service} busy, will retry OOM")
    except:
        pass  # Service not available

# Now try to load your model (with OOM retry as backup)

Helper script: See

request_gpu_unload.py

in OneCuriousRabbit repo.

Key Settings

Invoice OCR (Qwen2-VL)

✅ OOM retry: 3x with 30s delays ✅ Auto-unload: 5 minutes idle (configurable via

--auto-unload-minutes

) ✅ Request-unload endpoint:

POST http://10.99.0.3:8765/request-unload

Ollama

✅ Auto-unload:

OLLAMA_KEEP_ALIVE=30s

in systemd override

Your Other Services

Implement OOM retry pattern (required)
Optionally implement signaling protocol (auto-unload + request-unload endpoints)

How It Works

Passive (OOM Retry Only)

12:00 - Scheduled Qwen task starts, loads 4GB 12:01 - User uploads invoice, tries to load 18GB → OOM 12:01 - Invoice OCR waits 30s 12:01:30 - Qwen task finishes, auto-unloads after 30s 12:02 - Invoice OCR retry succeeds, loads 18GB 12:03 - Invoice processing completes, unloads 12:03:30 - GPU is free again

Active (With Signaling)

12:00 - User starts Flux generation 12:00 - Flux calls

POST /request-unload

on Invoice OCR 12:00 - Invoice OCR idle for 4 minutes → unloads immediately 12:00 - Flux loads its model (22GB) successfully 12:05 - Flux completes, auto-unloads after 5 minutes

Benefits of signaling:

Faster starts (no waiting for OOM retry delays)
More predictable behavior
Can request unload proactively before attempting load
OOM retry still works as fallback if service is busy