best-gpu-perf
git clone https://github.com/phtphtpht/gpu-perf-playbook
T=$(mktemp -d) && git clone --depth=1 https://github.com/phtphtpht/gpu-perf-playbook "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skill" ~/.claude/skills/phtphtpht-gpu-perf-playbook-best-gpu-perf && rm -rf "$T"
skill/SKILL.mdBest GPU Performance Skill
When to use this skill
Any time the user's task involves making AI/ML code run faster, use less memory, scale to more GPUs, or serve inference more efficiently. This includes writing new code with performance in mind, reviewing existing code for bottlenecks, debugging OOM or low utilization, and creating optimization plans.
Core Methodology: The 4-Step Loop
Every optimization follows this loop. Never skip steps.
1. MEASURE → 2. CLASSIFY → 3. APPLY → 4. VERIFY baseline bottleneck targeted regression + profile type fix test
Step 1: Measure — Establish a reproducible baseline
- Fix: commit hash, dependency versions, input shape/batch/seq/dtype, random seed
- Separate warmup from steady-state (compile/autotune/cache-building costs excluded)
- Record at minimum: throughput (tokens/s or samples/s), step time (P50/P95), peak memory
- Use CUDA events for timing, not
:time.time()
def benchmark(fn, warmup=20, iters=100): for _ in range(warmup): fn() torch.cuda.synchronize() start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) times = [] for _ in range(iters): start.record() fn() end.record() torch.cuda.synchronize() times.append(start.elapsed_time(end)) return {"mean_ms": sum(times)/len(times), "p50_ms": sorted(times)[len(times)//2], "p95_ms": sorted(times)[int(len(times)*0.95)]}
Step 2: Classify — Identify the bottleneck type
| Bottleneck | Symptom | Diagnostic |
|---|---|---|
| Compute-bound | GPU util high, SM busy, near peak FLOPS | Nsight Compute roofline near compute ceiling |
| Memory-bound | GPU util fluctuates, HBM bandwidth saturated | Roofline near memory ceiling; element-wise/norm ops dominate |
| Launch/Host-bound | GPU util very low, many tiny kernels, CPU busy | Nsight Systems shows kernel gaps; many <10μs kernels |
| Communication-bound | All-reduce/all-gather dominates step time | Nsight Systems shows exposed NCCL collective time |
| Data-loading-bound | GPU idle between steps | Profiler shows gaps between forward passes; CPU at 100% |
Tool hierarchy: Nsight Systems (who waits for whom) → PyTorch Profiler (which op is expensive) → Nsight Compute (why is this kernel slow)
Step 3: Apply — Use the targeted fix
Consult the appropriate reference file based on the bottleneck and task:
| Topic | Reference file | When to read |
|---|---|---|
| Training optimization (data, AMP, compile, memory, checkpointing) | | Training speed, OOM, MFU |
| Inference optimization (KV cache, batching, quantization, serving) | | Inference latency, throughput, LLM serving |
| Distributed training (DDP, FSDP, ZeRO, communication, topology) | | Multi-GPU, scaling efficiency |
| Kernel & low-level (Triton, CUDA, memory access, fusion) | | Custom ops, memory-bound hot spots |
| Profiling commands & tools | | Setting up profiling, interpreting results |
| Anti-patterns & code review checklist | | Code review, pre-commit check, debugging |
Step 4: Verify — Confirm improvement and catch regressions
- Re-run the same baseline benchmark
- Check: throughput improved? Memory within budget? Numerics stable?
- Feature flag every optimization so it can be rolled back
- For distributed changes: verify scaling efficiency, not just single-node speed
Priority System (P0 → P1 → P2)
Always complete all applicable P0 items before moving to P1. Only enter P2 with profiler evidence.
P0 — Default for all code (low risk, high reward)
- Mixed precision (BF16/FP16): Enable AMP with bf16 on Ampere+. Tensor Core utilization is the single biggest lever.
- torch.compile: Wrap the model (not the loop). Start with default mode.
- DataLoader:
,num_workers>0
,pin_memory=True
,persistent_workers=True.to(device, non_blocking=True) - Eliminate sync points: No
,.item()
,.cpu()
in hot path. Useprint(tensor)
for eval.torch.inference_mode() - optimizer.zero_grad(set_to_none=True) and fused=True optimizer
- Dimension alignment: Hidden dim, vocab size, batch size → multiples of 8 (fp16) or 64 (optimal)
P1 — Apply when profiler points to specific bottleneck
- FlashAttention /
withF.scaled_dot_product_attentionis_causal=True - Gradient checkpointing (activation memory > parameter memory)
- CUDA Graphs /
mode (launch-bound workloads)reduce-overhead - Static shapes / bucketing (reduce recompilation)
- FSDP / ZeRO (model doesn't fit on single GPU)
- DDP
+ gradient accumulationno_sync() - Quantization for inference (AWQ, GPTQ, bitsandbytes)
P2 — Expert-level, with strong profiler evidence only
- Custom Triton kernels for hot-spot fusion
- NCCL topology tuning, CPU pinning, NUMA alignment
- Tensor Parallel / Pipeline Parallel
- Custom CUDA kernels
- Structured sparsity
Standard Training Template
import torch from torch.utils.data import DataLoader # --- P0 defaults --- torch.set_float32_matmul_precision("high") # TF32 on Ampere+ torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.allow_tf32 = True loader = DataLoader( dataset, batch_size=batch_size, num_workers=min(4 * num_gpus, os.cpu_count()), pin_memory=True, persistent_workers=True, drop_last=True, ) model = torch.compile(model.cuda()) # compile the model optimizer = torch.optim.AdamW(model.parameters(), lr=lr, fused=True) for batch in loader: batch = {k: v.cuda(non_blocking=True) for k, v in batch.items()} optimizer.zero_grad(set_to_none=True) with torch.autocast("cuda", dtype=torch.bfloat16): loss = model(**batch) loss.backward() optimizer.step()
Quick Decision Trees
"My training is slow"
GPU util < 50%? ├─ Yes → Is CPU at 100%? │ ├─ Yes → DataLoader bottleneck → more workers, pin_memory, pre-cache data │ └─ No → Launch overhead → torch.compile, larger batch, CUDA Graphs ├─ GPU util 50-80%? │ ├─ Many small kernels? → Fusion: torch.compile, FlashAttention, Triton │ └─ HBM bandwidth saturated? → Memory-bound: reduce reads (fusion, bf16) └─ GPU util > 80%, MFU < 30%? ├─ No AMP? → Enable bf16/fp16 ├─ Hand-written attention? → Switch to SDPA/FlashAttention └─ Dimensions misaligned? → Pad to multiples of 8/64
"OOM"
1. Enable AMP (fp32 → bf16 = 2x memory reduction on params+activations) 2. Gradient checkpointing on Transformer blocks 3. Reduce batch size + gradient accumulation 4. Reduce shape variance (bucketing, fixed padding) 5. FSDP / ZeRO to shard across GPUs 6. CPU offload (last resort — significant speed penalty)
"Multi-GPU scaling is bad"
1. Check: is communication exposed? (Nsight Systems timeline) 2. DDP: increase bucket_cap_mb, enable gradient_as_bucket_view 3. Gradient accumulation + no_sync() for micro-batches 4. FSDP: verify prefetch settings, sharding strategy 5. Network: check NVLink vs PCIe topology (nvidia-smi topo -m) 6. Data: ensure balanced batch distribution across ranks
Key Numbers to Remember
| GPU | BF16 Tensor TFLOPS | HBM BW (TB/s) | Memory | Arithmetic Intensity Crossover |
|---|---|---|---|---|
| A100 80GB | 312 | 2.0 | 80 GB | ~156 FLOP/Byte |
| H100 SXM | 989 | 3.35 | 80 GB | ~295 FLOP/Byte |
| RTX 4090 | 330 | 1.0 | 24 GB | ~330 FLOP/Byte |
Arithmetic intensity crossover = Tensor FLOPS ÷ HBM bandwidth. Operations below this ratio are memory-bound. Most element-wise ops are ~1-4 FLOP/Byte → almost always memory-bound → fusion is the answer, not more compute.
Memory Budget Formula
For a model with Φ parameters, training with bf16 + Adam:
Parameters: 2Φ bytes (bf16) Gradients: 2Φ bytes (bf16) Optimizer (Adam): 4Φ (fp32 copy) + 4Φ (m) + 4Φ (v) = 12Φ bytes Activations: ∝ batch × seq_len × hidden_dim × num_layers ──────────────────────────────── Total (no activations): ~16Φ bytes 7B model → ~112 GB (won't fit on one 80GB A100 without sharding)