best-gpu-perf

install
source · Clone the upstream repo
git clone https://github.com/phtphtpht/gpu-perf-playbook
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/phtphtpht/gpu-perf-playbook "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skill" ~/.claude/skills/phtphtpht-gpu-perf-playbook-best-gpu-perf && rm -rf "$T"
manifest: skill/SKILL.md
source content

Best GPU Performance Skill

When to use this skill

Any time the user's task involves making AI/ML code run faster, use less memory, scale to more GPUs, or serve inference more efficiently. This includes writing new code with performance in mind, reviewing existing code for bottlenecks, debugging OOM or low utilization, and creating optimization plans.

Core Methodology: The 4-Step Loop

Every optimization follows this loop. Never skip steps.

1. MEASURE  →  2. CLASSIFY  →  3. APPLY  →  4. VERIFY
   baseline       bottleneck      targeted      regression
   + profile      type            fix           test

Step 1: Measure — Establish a reproducible baseline

  • Fix: commit hash, dependency versions, input shape/batch/seq/dtype, random seed
  • Separate warmup from steady-state (compile/autotune/cache-building costs excluded)
  • Record at minimum: throughput (tokens/s or samples/s), step time (P50/P95), peak memory
  • Use CUDA events for timing, not
    time.time()
    :
def benchmark(fn, warmup=20, iters=100):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    times = []
    for _ in range(iters):
        start.record()
        fn()
        end.record()
        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))
    return {"mean_ms": sum(times)/len(times),
            "p50_ms": sorted(times)[len(times)//2],
            "p95_ms": sorted(times)[int(len(times)*0.95)]}

Step 2: Classify — Identify the bottleneck type

BottleneckSymptomDiagnostic
Compute-boundGPU util high, SM busy, near peak FLOPSNsight Compute roofline near compute ceiling
Memory-boundGPU util fluctuates, HBM bandwidth saturatedRoofline near memory ceiling; element-wise/norm ops dominate
Launch/Host-boundGPU util very low, many tiny kernels, CPU busyNsight Systems shows kernel gaps; many <10μs kernels
Communication-boundAll-reduce/all-gather dominates step timeNsight Systems shows exposed NCCL collective time
Data-loading-boundGPU idle between stepsProfiler shows gaps between forward passes; CPU at 100%

Tool hierarchy: Nsight Systems (who waits for whom) → PyTorch Profiler (which op is expensive) → Nsight Compute (why is this kernel slow)

Step 3: Apply — Use the targeted fix

Consult the appropriate reference file based on the bottleneck and task:

TopicReference fileWhen to read
Training optimization (data, AMP, compile, memory, checkpointing)
../docs/training.md
Training speed, OOM, MFU
Inference optimization (KV cache, batching, quantization, serving)
../docs/inference.md
Inference latency, throughput, LLM serving
Distributed training (DDP, FSDP, ZeRO, communication, topology)
../docs/distributed.md
Multi-GPU, scaling efficiency
Kernel & low-level (Triton, CUDA, memory access, fusion)
../docs/kernel.md
Custom ops, memory-bound hot spots
Profiling commands & tools
../docs/profiling.md
Setting up profiling, interpreting results
Anti-patterns & code review checklist
../docs/checklist.md
Code review, pre-commit check, debugging

Step 4: Verify — Confirm improvement and catch regressions

  • Re-run the same baseline benchmark
  • Check: throughput improved? Memory within budget? Numerics stable?
  • Feature flag every optimization so it can be rolled back
  • For distributed changes: verify scaling efficiency, not just single-node speed

Priority System (P0 → P1 → P2)

Always complete all applicable P0 items before moving to P1. Only enter P2 with profiler evidence.

P0 — Default for all code (low risk, high reward)

  1. Mixed precision (BF16/FP16): Enable AMP with bf16 on Ampere+. Tensor Core utilization is the single biggest lever.
  2. torch.compile: Wrap the model (not the loop). Start with default mode.
  3. DataLoader:
    num_workers>0
    ,
    pin_memory=True
    ,
    persistent_workers=True
    ,
    .to(device, non_blocking=True)
  4. Eliminate sync points: No
    .item()
    ,
    .cpu()
    ,
    print(tensor)
    in hot path. Use
    torch.inference_mode()
    for eval.
  5. optimizer.zero_grad(set_to_none=True) and fused=True optimizer
  6. Dimension alignment: Hidden dim, vocab size, batch size → multiples of 8 (fp16) or 64 (optimal)

P1 — Apply when profiler points to specific bottleneck

  1. FlashAttention /
    F.scaled_dot_product_attention
    with
    is_causal=True
  2. Gradient checkpointing (activation memory > parameter memory)
  3. CUDA Graphs /
    reduce-overhead
    mode (launch-bound workloads)
  4. Static shapes / bucketing (reduce recompilation)
  5. FSDP / ZeRO (model doesn't fit on single GPU)
  6. DDP
    no_sync()
    + gradient accumulation
  7. Quantization for inference (AWQ, GPTQ, bitsandbytes)

P2 — Expert-level, with strong profiler evidence only

  1. Custom Triton kernels for hot-spot fusion
  2. NCCL topology tuning, CPU pinning, NUMA alignment
  3. Tensor Parallel / Pipeline Parallel
  4. Custom CUDA kernels
  5. Structured sparsity

Standard Training Template

import torch
from torch.utils.data import DataLoader

# --- P0 defaults ---
torch.set_float32_matmul_precision("high")          # TF32 on Ampere+
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

loader = DataLoader(
    dataset,
    batch_size=batch_size,
    num_workers=min(4 * num_gpus, os.cpu_count()),
    pin_memory=True,
    persistent_workers=True,
    drop_last=True,
)

model = torch.compile(model.cuda())                  # compile the model
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, fused=True)

for batch in loader:
    batch = {k: v.cuda(non_blocking=True) for k, v in batch.items()}
    optimizer.zero_grad(set_to_none=True)

    with torch.autocast("cuda", dtype=torch.bfloat16):
        loss = model(**batch)

    loss.backward()
    optimizer.step()

Quick Decision Trees

"My training is slow"

GPU util < 50%?
├─ Yes → Is CPU at 100%?
│  ├─ Yes → DataLoader bottleneck → more workers, pin_memory, pre-cache data
│  └─ No  → Launch overhead → torch.compile, larger batch, CUDA Graphs
├─ GPU util 50-80%?
│  ├─ Many small kernels? → Fusion: torch.compile, FlashAttention, Triton
│  └─ HBM bandwidth saturated? → Memory-bound: reduce reads (fusion, bf16)
└─ GPU util > 80%, MFU < 30%?
   ├─ No AMP? → Enable bf16/fp16
   ├─ Hand-written attention? → Switch to SDPA/FlashAttention
   └─ Dimensions misaligned? → Pad to multiples of 8/64

"OOM"

1. Enable AMP (fp32 → bf16 = 2x memory reduction on params+activations)
2. Gradient checkpointing on Transformer blocks
3. Reduce batch size + gradient accumulation
4. Reduce shape variance (bucketing, fixed padding)
5. FSDP / ZeRO to shard across GPUs
6. CPU offload (last resort — significant speed penalty)

"Multi-GPU scaling is bad"

1. Check: is communication exposed? (Nsight Systems timeline)
2. DDP: increase bucket_cap_mb, enable gradient_as_bucket_view
3. Gradient accumulation + no_sync() for micro-batches
4. FSDP: verify prefetch settings, sharding strategy
5. Network: check NVLink vs PCIe topology (nvidia-smi topo -m)
6. Data: ensure balanced batch distribution across ranks

Key Numbers to Remember

GPUBF16 Tensor TFLOPSHBM BW (TB/s)MemoryArithmetic Intensity Crossover
A100 80GB3122.080 GB~156 FLOP/Byte
H100 SXM9893.3580 GB~295 FLOP/Byte
RTX 40903301.024 GB~330 FLOP/Byte

Arithmetic intensity crossover = Tensor FLOPS ÷ HBM bandwidth. Operations below this ratio are memory-bound. Most element-wise ops are ~1-4 FLOP/Byte → almost always memory-bound → fusion is the answer, not more compute.

Memory Budget Formula

For a model with Φ parameters, training with bf16 + Adam:

Parameters:       2Φ bytes (bf16)
Gradients:        2Φ bytes (bf16)  
Optimizer (Adam): 4Φ (fp32 copy) + 4Φ (m) + 4Φ (v) = 12Φ bytes
Activations:      ∝ batch × seq_len × hidden_dim × num_layers
────────────────────────────────
Total (no activations): ~16Φ bytes
7B model → ~112 GB (won't fit on one 80GB A100 without sharding)