Claude-skill-registry gpu-aware-training-config

GPU-aware PPO training configuration for A100/H100. Trigger when training is slow or GPU utilization is low.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/gpu-aware-training-config" ~/.claude/skills/majiayu000-claude-skill-registry-gpu-aware-training-config && rm -rf "$T"
manifest: skills/data/gpu-aware-training-config/SKILL.md
source content

GPU-Aware Training Configuration

Experiment Overview

ItemDetails
Date2025-12-18
GoalFix extremely slow A100 training (FPS 4,500 vs expected 30,000-50,000)
EnvironmentGoogle Colab A100, PyTorch 2.x, CUDA
StatusSuccess - 10x+ speedup achieved

Context

Training was extremely slow on A100 Colab GPU despite using "quick_test" mode. Investigation revealed that

get_auto_config(training_mode="quick_test")
was returning a generic config with n_envs=256 and torch.compile=False, completely ignoring GPU capabilities.

Root Cause

The original

get_auto_config()
function had training modes that completely bypassed GPU detection:

# WRONG - ignores GPU capabilities
def get_auto_config(total_timesteps, training_mode="auto"):
    if training_mode == "quick_test":
        return NativePPOConfig(
            n_envs=256,           # Too low for A100!
            compile_policy=False,  # Missing 3-6x speedup!
            # ... generic settings
        )

Verified Solution

Training modes must layer on top of GPU-specific settings, not replace them:

def get_auto_config(total_timesteps=1_000_000, training_mode="auto"):
    # Step 1: ALWAYS detect GPU first
    gpu_tier = _detect_gpu_tier()  # "h100", "a100", "high", "medium", "low"

    # Step 2: Get GPU-appropriate base config
    if gpu_tier == "h100":
        config = _get_h100_base_config()
    elif gpu_tier == "a100":
        config = _get_a100_base_config()
    # ... etc

    # Step 3: Apply training mode ADJUSTMENTS (not replacements)
    if training_mode == "quick_test":
        config.total_timesteps = 10_000_000
        config.validation_interval = 25
        # BUT KEEP GPU-specific n_envs, compile_policy, etc!

GPU Configuration Matrix

GPU Tiern_envsn_stepsminibatchcompileFP8Expected FPS
H100-80GB20485128192TrueTrue80,000-120,000
A100-80GB20485128192TrueFalse50,000-80,000
A100-40GB10245124096TrueFalse40,000-60,000
RTX 409010245124096TrueFalse30,000-50,000
RTX 30905125122048TrueFalse20,000-35,000
Generic2565122048FalseFalse5,000-15,000

Training Mode Adjustments

Training modes should ONLY adjust these parameters:

Modetimestepsn_epochsvalidation_intervalNotes
quick_test10M1025Fast iteration
standard50M1250Development
production200M15100Full training
extended500M20200Maximum learning

GPU Detection Code

def _detect_gpu_tier() -> str:
    """Detect GPU tier for optimal configuration."""
    if not torch.cuda.is_available():
        return "cpu"

    gpu_name = torch.cuda.get_device_name(0).lower()
    vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9

    # Check for H100 (compute capability 9.0+)
    compute_cap = torch.cuda.get_device_capability(0)
    if compute_cap[0] >= 9:
        return "h100"

    # Check for A100
    if "a100" in gpu_name:
        return "a100"

    # Tier by VRAM
    if vram_gb >= 40:
        return "high"
    elif vram_gb >= 20:
        return "medium"
    else:
        return "low"

Failed Attempts (Critical)

AttemptWhy it FailedLesson Learned
Training mode completely replaces configLost GPU-specific optimizationsModes should layer adjustments, not replace
n_envs=256 on A100Only 5-12% GPU utilizationNeed 1000+ envs for GPU saturation
compile_policy=False in quick_testMissing 3-6x speedupAlways enable torch.compile on modern GPUs
Fixed config for all GPUsWasted resources or OOM errorsDetect GPU and scale accordingly
Checking GPU only in "auto" modequick_test/standard modes got generic configALWAYS detect GPU, regardless of mode

Diagnostic Checklist

If training is slow, check these in order:

  1. FPS < 10,000 on A100? → Check n_envs (should be 1024+)
  2. torch.compile: False? → Enable it (3-6x speedup after warmup)
  3. GPU util < 20%? → Increase n_envs
  4. Memory errors? → Decrease n_envs or minibatch_size
  5. H100 with FP8=False? → Enable FP8 for additional speedup

Key Insights

  • GPU detection must happen FIRST, before applying training modes
  • Research shows 1000+ parallel environments needed for GPU saturation
  • torch.compile provides 3-6x speedup but takes 10+ min to warmup
  • FP8 is only available on Hopper architecture (H100, compute capability 9.0+)
  • Training modes should adjust timesteps/epochs, NOT hardware-specific params

Quick Fix Command

If you see slow training on A100, the config should show:

n_envs: 1024+
torch.compile: True
compile_mode: reduce-overhead

If any of these are wrong, the

get_auto_config()
function isn't detecting the GPU properly.

References