Claude-skill-registry Vram-GPU-OOM
GPU VRAM management patterns for sharing memory across services (Ollama, Whisper, ComfyUI). OOM retry logic, auto-unload on idle, and service signaling protocol.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/Vram-GPU-OOM-memory-management" ~/.claude/skills/majiayu000-claude-skill-registry-vram-gpu-oom && rm -rf "$T"
skills/data/Vram-GPU-OOM-memory-management/SKILL.mdGPU OOM Retry Pattern
Simple pattern for sharing GPU memory across multiple services without coordination.
Strategy
- All services try to load models normally
- Catch OOM errors
- Wait 30-60 seconds (for other services to auto-unload)
- Retry up to 3 times
- Configure all services to unload quickly when idle
Python (PyTorch / Transformers)
import torch import time def load_model_with_retry(max_retries=3, retry_delay=30): for attempt in range(max_retries): try: # Your model loading code model = MyModel.from_pretrained("model-name") model.to("cuda") return model except RuntimeError as e: if "out of memory" in str(e).lower(): if attempt < max_retries - 1: print(f"OOM on attempt {attempt+1}, waiting {retry_delay}s...") torch.cuda.empty_cache() # Clean up time.sleep(retry_delay) else: raise # Give up after max retries else: raise # Not OOM, raise immediately
ComfyUI / Flux (Python-based)
Add to your workflow/node:
# In your model loading function import torch import time def load_flux_model(path, max_retries=3): for attempt in range(max_retries): try: # Your Flux/ComfyUI loading code model = comfy.utils.load_torch_file(path) return model except RuntimeError as e: if "out of memory" in str(e).lower(): if attempt < max_retries - 1: print(f"GPU busy, retrying in 30s...") torch.cuda.empty_cache() time.sleep(30) else: raise else: raise
Ollama
Ollama already handles this! Just configure quick unloading:
# In /etc/systemd/system/ollama.service.d/override.conf Environment="OLLAMA_KEEP_ALIVE=30s"
Shell Scripts
For any GPU command:
#!/bin/bash MAX_RETRIES=3 RETRY_DELAY=30 for i in $(seq 1 $MAX_RETRIES); do if your-gpu-command; then exit 0 fi if [ $i -lt $MAX_RETRIES ]; then echo "GPU busy, retrying in ${RETRY_DELAY}s..." sleep $RETRY_DELAY fi done echo "Failed after $MAX_RETRIES attempts" exit 1
Service Signaling Protocol (Optional Enhancement)
For better coordination, services can implement these endpoints:
1. Auto-Unload on Idle
Services can automatically unload models after idle timeout:
# FastAPI example import asyncio import time last_request_time = None auto_unload_minutes = 5 # configurable async def auto_unload_task(): """Background task that unloads model after idle timeout.""" while True: await asyncio.sleep(60) # Check every minute if current_handler is None: continue idle = time.time() - last_request_time if idle > (auto_unload_minutes * 60): logger.info(f"Auto-unloading model after {idle/60:.1f} minutes") current_handler.unload() current_handler = None @app.on_event("startup") async def startup(): asyncio.create_task(auto_unload_task())
2. Request-Unload Endpoint
Allow other services to politely request unload:
@app.post("/request-unload") async def request_unload(): """Request model unload if idle.""" if current_handler is None: return {"status": "ok", "unloaded": False, "message": "No model loaded"} idle = time.time() - last_request_time # Only unload if idle for at least 30 seconds if idle < 30: return { "status": "busy", "unloaded": False, "message": f"Model in use (idle {idle:.0f}s)", "idle_seconds": idle, } # Unload the model logger.info("Unloading on request from another service") current_handler.unload() current_handler = None return { "status": "ok", "unloaded": True, "message": "Model unloaded", "idle_seconds": idle, }
3. Enhanced Status Endpoint
@app.get("/status") async def get_status(): idle = time.time() - last_request_time if last_request_time else None return { "status": "ok", "model_loaded": current_handler is not None, "idle_seconds": idle, "auto_unload_enabled": auto_unload_minutes is not None, "auto_unload_minutes": auto_unload_minutes, }
4. Using the Protocol
Before loading a large model, request other services to unload:
import requests SERVICES = [ "http://10.99.0.3:8765", # Invoice OCR # Add other services here ] for service in SERVICES: try: resp = requests.post(f"{service}/request-unload", timeout=5) result = resp.json() if result.get("unloaded"): print(f"✓ {service} unloaded") elif result.get("status") == "busy": print(f"⏱ {service} busy, will retry OOM") except: pass # Service not available # Now try to load your model (with OOM retry as backup)
Helper script: See
request_gpu_unload.py in OneCuriousRabbit repo.
Key Settings
Invoice OCR (Qwen2-VL)
✅ OOM retry: 3x with 30s delays ✅ Auto-unload: 5 minutes idle (configurable via
--auto-unload-minutes)
✅ Request-unload endpoint: POST http://10.99.0.3:8765/request-unload
Ollama
✅ Auto-unload:
OLLAMA_KEEP_ALIVE=30s in systemd override
Your Other Services
- Implement OOM retry pattern (required)
- Optionally implement signaling protocol (auto-unload + request-unload endpoints)
How It Works
Passive (OOM Retry Only)
12:00 - Scheduled Qwen task starts, loads 4GB 12:01 - User uploads invoice, tries to load 18GB → OOM 12:01 - Invoice OCR waits 30s 12:01:30 - Qwen task finishes, auto-unloads after 30s 12:02 - Invoice OCR retry succeeds, loads 18GB 12:03 - Invoice processing completes, unloads 12:03:30 - GPU is free again
Active (With Signaling)
12:00 - User starts Flux generation 12:00 - Flux calls
POST /request-unload on Invoice OCR
12:00 - Invoice OCR idle for 4 minutes → unloads immediately
12:00 - Flux loads its model (22GB) successfully
12:05 - Flux completes, auto-unloads after 5 minutes
Benefits of signaling:
- Faster starts (no waiting for OOM retry delays)
- More predictable behavior
- Can request unload proactively before attempting load
- OOM retry still works as fallback if service is busy