Claude-skill-registry gpu-aware-training-config

GPU-aware PPO training configuration for A100/H100. Trigger when training is slow or GPU utilization is low.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/gpu-aware-training-config" ~/.claude/skills/majiayu000-claude-skill-registry-gpu-aware-training-config && rm -rf "$T"

manifest: skills/data/gpu-aware-training-config/SKILL.md

GPU-Aware Training Configuration

Experiment Overview

Item	Details
Date	2025-12-18
Goal	Fix extremely slow A100 training (FPS 4,500 vs expected 30,000-50,000)
Environment	Google Colab A100, PyTorch 2.x, CUDA
Status	Success - 10x+ speedup achieved

Context

Training was extremely slow on A100 Colab GPU despite using "quick_test" mode. Investigation revealed that

get_auto_config(training_mode="quick_test")

was returning a generic config with n_envs=256 and torch.compile=False, completely ignoring GPU capabilities.

Root Cause

The original

get_auto_config()

function had training modes that completely bypassed GPU detection:

# WRONG - ignores GPU capabilities
def get_auto_config(total_timesteps, training_mode="auto"):
    if training_mode == "quick_test":
        return NativePPOConfig(
            n_envs=256,           # Too low for A100!
            compile_policy=False,  # Missing 3-6x speedup!
            # ... generic settings
        )

Verified Solution

Training modes must layer on top of GPU-specific settings, not replace them:

def get_auto_config(total_timesteps=1_000_000, training_mode="auto"):
    # Step 1: ALWAYS detect GPU first
    gpu_tier = _detect_gpu_tier()  # "h100", "a100", "high", "medium", "low"

    # Step 2: Get GPU-appropriate base config
    if gpu_tier == "h100":
        config = _get_h100_base_config()
    elif gpu_tier == "a100":
        config = _get_a100_base_config()
    # ... etc

    # Step 3: Apply training mode ADJUSTMENTS (not replacements)
    if training_mode == "quick_test":
        config.total_timesteps = 10_000_000
        config.validation_interval = 25
        # BUT KEEP GPU-specific n_envs, compile_policy, etc!

GPU Configuration Matrix

GPU Tier	n_envs	n_steps	minibatch	compile	FP8	Expected FPS
H100-80GB	2048	512	8192	True	True	80,000-120,000
A100-80GB	2048	512	8192	True	False	50,000-80,000
A100-40GB	1024	512	4096	True	False	40,000-60,000
RTX 4090	1024	512	4096	True	False	30,000-50,000
RTX 3090	512	512	2048	True	False	20,000-35,000
Generic	256	512	2048	False	False	5,000-15,000

Training Mode Adjustments

Training modes should ONLY adjust these parameters:

Mode	timesteps	n_epochs	validation_interval	Notes
quick_test	10M	10	25	Fast iteration
standard	50M	12	50	Development
production	200M	15	100	Full training
extended	500M	20	200	Maximum learning

GPU Detection Code

def _detect_gpu_tier() -> str:
    """Detect GPU tier for optimal configuration."""
    if not torch.cuda.is_available():
        return "cpu"

    gpu_name = torch.cuda.get_device_name(0).lower()
    vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9

    # Check for H100 (compute capability 9.0+)
    compute_cap = torch.cuda.get_device_capability(0)
    if compute_cap[0] >= 9:
        return "h100"

    # Check for A100
    if "a100" in gpu_name:
        return "a100"

    # Tier by VRAM
    if vram_gb >= 40:
        return "high"
    elif vram_gb >= 20:
        return "medium"
    else:
        return "low"

Failed Attempts (Critical)

Attempt	Why it Failed	Lesson Learned
Training mode completely replaces config	Lost GPU-specific optimizations	Modes should layer adjustments, not replace
n_envs=256 on A100	Only 5-12% GPU utilization	Need 1000+ envs for GPU saturation
compile_policy=False in quick_test	Missing 3-6x speedup	Always enable torch.compile on modern GPUs
Fixed config for all GPUs	Wasted resources or OOM errors	Detect GPU and scale accordingly
Checking GPU only in "auto" mode	quick_test/standard modes got generic config	ALWAYS detect GPU, regardless of mode

Diagnostic Checklist

If training is slow, check these in order:

FPS < 10,000 on A100? → Check n_envs (should be 1024+)
torch.compile: False? → Enable it (3-6x speedup after warmup)
GPU util < 20%? → Increase n_envs
Memory errors? → Decrease n_envs or minibatch_size
H100 with FP8=False? → Enable FP8 for additional speedup

Key Insights

GPU detection must happen FIRST, before applying training modes
Research shows 1000+ parallel environments needed for GPU saturation
torch.compile provides 3-6x speedup but takes 10+ min to warmup
FP8 is only available on Hopper architecture (H100, compute capability 9.0+)
Training modes should adjust timesteps/epochs, NOT hardware-specific params

Quick Fix Command

If you see slow training on A100, the config should show:

n_envs: 1024+
torch.compile: True
compile_mode: reduce-overhead

If any of these are wrong, the

get_auto_config()

function isn't detecting the GPU properly.