best-gpu-perf

install

source · Clone the upstream repo

git clone https://github.com/phtphtpht/gpu-perf-playbook

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/phtphtpht/gpu-perf-playbook "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skill" ~/.claude/skills/phtphtpht-gpu-perf-playbook-best-gpu-perf && rm -rf "$T"

manifest: skill/SKILL.md

source content

Best GPU Performance Skill

When to use this skill

Any time the user's task involves making AI/ML code run faster, use less memory, scale to more GPUs, or serve inference more efficiently. This includes writing new code with performance in mind, reviewing existing code for bottlenecks, debugging OOM or low utilization, and creating optimization plans.

Core Methodology: The 4-Step Loop

Every optimization follows this loop. Never skip steps.

1. MEASURE  →  2. CLASSIFY  →  3. APPLY  →  4. VERIFY
   baseline       bottleneck      targeted      regression
   + profile      type            fix           test

Step 1: Measure — Establish a reproducible baseline

Fix: commit hash, dependency versions, input shape/batch/seq/dtype, random seed
Separate warmup from steady-state (compile/autotune/cache-building costs excluded)
Record at minimum: throughput (tokens/s or samples/s), step time (P50/P95), peak memory
Use CUDA events for timing, not
```
time.time()
```
:

def benchmark(fn, warmup=20, iters=100):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    times = []
    for _ in range(iters):
        start.record()
        fn()
        end.record()
        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))
    return {"mean_ms": sum(times)/len(times),
            "p50_ms": sorted(times)[len(times)//2],
            "p95_ms": sorted(times)[int(len(times)*0.95)]}

Step 2: Classify — Identify the bottleneck type

Bottleneck	Symptom	Diagnostic
Compute-bound	GPU util high, SM busy, near peak FLOPS	Nsight Compute roofline near compute ceiling
Memory-bound	GPU util fluctuates, HBM bandwidth saturated	Roofline near memory ceiling; element-wise/norm ops dominate
Launch/Host-bound	GPU util very low, many tiny kernels, CPU busy	Nsight Systems shows kernel gaps; many <10μs kernels
Communication-bound	All-reduce/all-gather dominates step time	Nsight Systems shows exposed NCCL collective time
Data-loading-bound	GPU idle between steps	Profiler shows gaps between forward passes; CPU at 100%

Tool hierarchy: Nsight Systems (who waits for whom) → PyTorch Profiler (which op is expensive) → Nsight Compute (why is this kernel slow)

Step 3: Apply — Use the targeted fix

Consult the appropriate reference file based on the bottleneck and task:

Topic	Reference file	When to read
Training optimization (data, AMP, compile, memory, checkpointing)	`../docs/training.md`	Training speed, OOM, MFU
Inference optimization (KV cache, batching, quantization, serving)	`../docs/inference.md`	Inference latency, throughput, LLM serving
Distributed training (DDP, FSDP, ZeRO, communication, topology)	`../docs/distributed.md`	Multi-GPU, scaling efficiency
Kernel & low-level (Triton, CUDA, memory access, fusion)	`../docs/kernel.md`	Custom ops, memory-bound hot spots
Profiling commands & tools	`../docs/profiling.md`	Setting up profiling, interpreting results
Anti-patterns & code review checklist	`../docs/checklist.md`	Code review, pre-commit check, debugging

Step 4: Verify — Confirm improvement and catch regressions

Re-run the same baseline benchmark
Check: throughput improved? Memory within budget? Numerics stable?
Feature flag every optimization so it can be rolled back
For distributed changes: verify scaling efficiency, not just single-node speed

Priority System (P0 → P1 → P2)

Always complete all applicable P0 items before moving to P1. Only enter P2 with profiler evidence.

P0 — Default for all code (low risk, high reward)

Mixed precision (BF16/FP16): Enable AMP with bf16 on Ampere+. Tensor Core utilization is the single biggest lever.
torch.compile: Wrap the model (not the loop). Start with default mode.

DataLoader:

num_workers>0

pin_memory=True

persistent_workers=True

.to(device, non_blocking=True)

Eliminate sync points: No
```
.item()
```
,
```
.cpu()
```
,
```
print(tensor)
```
in hot path. Use
```
torch.inference_mode()
```
for eval.
optimizer.zero_grad(set_to_none=True) and fused=True optimizer
Dimension alignment: Hidden dim, vocab size, batch size → multiples of 8 (fp16) or 64 (optimal)

P1 — Apply when profiler points to specific bottleneck

FlashAttention /

F.scaled_dot_product_attention

with

is_causal=True

Gradient checkpointing (activation memory > parameter memory)
CUDA Graphs /
```
reduce-overhead
```
mode (launch-bound workloads)
Static shapes / bucketing (reduce recompilation)
FSDP / ZeRO (model doesn't fit on single GPU)
DDP
```
no_sync()
```
+ gradient accumulation
Quantization for inference (AWQ, GPTQ, bitsandbytes)

P2 — Expert-level, with strong profiler evidence only

Custom Triton kernels for hot-spot fusion
NCCL topology tuning, CPU pinning, NUMA alignment
Tensor Parallel / Pipeline Parallel
Custom CUDA kernels
Structured sparsity

Standard Training Template

import torch
from torch.utils.data import DataLoader

# --- P0 defaults ---
torch.set_float32_matmul_precision("high")          # TF32 on Ampere+
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

loader = DataLoader(
    dataset,
    batch_size=batch_size,
    num_workers=min(4 * num_gpus, os.cpu_count()),
    pin_memory=True,
    persistent_workers=True,
    drop_last=True,
)

model = torch.compile(model.cuda())                  # compile the model
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, fused=True)

for batch in loader:
    batch = {k: v.cuda(non_blocking=True) for k, v in batch.items()}
    optimizer.zero_grad(set_to_none=True)

    with torch.autocast("cuda", dtype=torch.bfloat16):
        loss = model(**batch)

    loss.backward()
    optimizer.step()

Quick Decision Trees

"My training is slow"

GPU util < 50%?
├─ Yes → Is CPU at 100%?
│  ├─ Yes → DataLoader bottleneck → more workers, pin_memory, pre-cache data
│  └─ No  → Launch overhead → torch.compile, larger batch, CUDA Graphs
├─ GPU util 50-80%?
│  ├─ Many small kernels? → Fusion: torch.compile, FlashAttention, Triton
│  └─ HBM bandwidth saturated? → Memory-bound: reduce reads (fusion, bf16)
└─ GPU util > 80%, MFU < 30%?
   ├─ No AMP? → Enable bf16/fp16
   ├─ Hand-written attention? → Switch to SDPA/FlashAttention
   └─ Dimensions misaligned? → Pad to multiples of 8/64

"OOM"

1. Enable AMP (fp32 → bf16 = 2x memory reduction on params+activations)
2. Gradient checkpointing on Transformer blocks
3. Reduce batch size + gradient accumulation
4. Reduce shape variance (bucketing, fixed padding)
5. FSDP / ZeRO to shard across GPUs
6. CPU offload (last resort — significant speed penalty)

"Multi-GPU scaling is bad"

1. Check: is communication exposed? (Nsight Systems timeline)
2. DDP: increase bucket_cap_mb, enable gradient_as_bucket_view
3. Gradient accumulation + no_sync() for micro-batches
4. FSDP: verify prefetch settings, sharding strategy
5. Network: check NVLink vs PCIe topology (nvidia-smi topo -m)
6. Data: ensure balanced batch distribution across ranks

Key Numbers to Remember

GPU	BF16 Tensor TFLOPS	HBM BW (TB/s)	Memory	Arithmetic Intensity Crossover
A100 80GB	312	2.0	80 GB	~156 FLOP/Byte
H100 SXM	989	3.35	80 GB	~295 FLOP/Byte
RTX 4090	330	1.0	24 GB	~330 FLOP/Byte

Arithmetic intensity crossover = Tensor FLOPS ÷ HBM bandwidth. Operations below this ratio are memory-bound. Most element-wise ops are ~1-4 FLOP/Byte → almost always memory-bound → fusion is the answer, not more compute.

Memory Budget Formula

For a model with Φ parameters, training with bf16 + Adam:

Parameters:       2Φ bytes (bf16)
Gradients:        2Φ bytes (bf16)  
Optimizer (Adam): 4Φ (fp32 copy) + 4Φ (m) + 4Φ (v) = 12Φ bytes
Activations:      ∝ batch × seq_len × hidden_dim × num_layers
────────────────────────────────
Total (no activations): ~16Φ bytes
7B model → ~112 GB (won't fit on one 80GB A100 without sharding)