Agent-almanac analyze-kernel-bottleneck
git clone https://github.com/pjt222/agent-almanac
T=$(mktemp -d) && git clone --depth=1 https://github.com/pjt222/agent-almanac "$T" && mkdir -p ~/.claude/skills && cp -r "$T/i18n/caveman-ultra/skills/analyze-kernel-bottleneck" ~/.claude/skills/pjt222-agent-almanac-analyze-kernel-bottleneck-f50632 && rm -rf "$T"
i18n/caveman-ultra/skills/analyze-kernel-bottleneck/SKILL.mdAnalyze Kernel Bottleneck
Identify GPU kernel = compute-bound, memory-bound, latency-bound. Baseline perf → roofline classify → occupancy + compute/load ratio/tile → SASS instr mix + stall codes → smem cliff → decision matrix → right opt strategy.
Use When
- Pre-opt any CUDA kernel → baseline + classify
- After 1st working ver → ID opt path
- Underperforms vs theoretical peak
- Deciding cp.async vs larger tiles vs algorithmic restructure
In
- Required: Compiled kernel (
or.cubin
+ build cmd).cu - Required: Bench harness launching via CUDA event timing
- Required: Problem dims (M, N, K for GEMM; seq_len, heads, head_dim for attention)
- Optional: Target GPU arch (default: GA104 / sm_86 / RTX 3070 Ti)
- Optional: Expected peak util % for compare
- Optional: Prior profiling data (Nsight Compute)
Do
Step 1: Baseline Perf
Run kernel w/ CUDA events (
BenchTimer), record ms. Calc effective throughput:
- Compile if not built:
nvcc --cubin -arch=sm_86 -O2 -o kernel.sm_86.cubin kernel.cu nvcc -arch=sm_86 -O2 -o bench bench.cu -lcuda -I../../phase2/common - Run representative sizes, warmup pre-measurement:
./bench 4096 4096 4096 - Record kernel ms from CUDA events (not wall-clock).
- Calc effective GFLOPS + BW:
- GEMM:
effective_gflops = (2 * M * N * K) / (time_ms / 1000) / 1e9 - BW-limited:
effective_bw = total_bytes / (time_ms / 1000) / 1e9 - Flash Attention:
effective_gflops = (4 * batch * heads * seq_len^2 * head_dim) / (time_ms / 1000) / 1e9
- GEMM:
→ Baseline: kernel ms, effective GFLOPS, effective BW.
If err: Check launches no err (
CHECK_CU). Warmup pre-measurement. Dims large enough saturate GPU (small → launch overhead bottleneck).
Step 2: Roofline Classify
Arithmetic intensity vs machine balance → classify:
- Calc AI:
. Count only unique bytes from DRAM (not shared mem or register reuse).AI = FLOPs / bytes_loaded_from_global_memory - Lookup balance:
.balance = peak_compute / peak_bandwidth - Classify:
→ memory-bound.AI < balance
→ compute-bound.AI > balance
GA104 (RTX 3070 Ti) Reference:
| Resource | Peak | Unit |
|---|---|---|
| FP32 FFMA | 21.7 | TFLOPS |
| FP16 Tensor Core (HMMA) | 174 | TFLOPS |
| INT8 Tensor Core (IMMA) | 696 | TOPS |
| DRAM Bandwidth | 608 | GB/s |
| L2 Cache | 4 | MB |
| SMs | 48 |
Derived Balance Points:
| Precision | Balance Point (FLOP/byte) |
|---|---|
| FP32 FFMA | 21700 / 608 = 35.7 |
| FP16 TC | 174000 / 608 = 286.2 |
| INT8 TC | 696000 / 608 = 1144.7 |
- Compute attained:
. Memory-bound → compare effective BW to 608 GB/s. Compute-bound → compare effective GFLOPS to relevant peak.attained = effective_throughput / peak_throughput
→ Classification: compute-bound, memory-bound, latency-bound (low occupancy → neither saturated) + numerical justification.
If err: Recheck byte counting. Watch redundant re-reads (e.g., 9x in direct conv2d no im2col). Neither saturated → latency-bound (Step 3).
Step 3: Occupancy
Active warps/SM from launch config + resource usage:
- Extract resource usage:
nvcc --cubin -arch=sm_86 -O2 --resource-usage -o kernel.sm_86.cubin kernel.cu 2>&1 | grep -E 'registers|smem' - Launch config:
.warps_per_block = threads_per_block / 32 - Blocks/SM per limiting factor:
- Register:
floor(65536 / (registers_per_thread * threads_per_block)) - Smem:
→ see Step 6 clifffloor(available_smem_per_SM / smem_per_block) - Warp:
(GA104 max: 48 warps/SM)floor(48 / warps_per_block) - Block: 16 blocks/SM max GA104
- Register:
- Actual blocks/SM =
.min(register_limit, smem_limit, warp_limit, block_limit) - Active warps/SM =
.blocks_per_SM * warps_per_block - Key threshold: 8 warps/SM enough latency hiding GA104. <8 = structural → latency-bound.
→ Occupancy table: blocks/SM, active warps/SM, limiting factor (registers, smem, warps).
If err: Check
cuFuncSetAttribute for dynamic smem. Verify --resource-usage matches actual launch config. High register → --maxrregcount=N (trade spills for occupancy).
Step 4: Compute/Load Ratio/Tile
Count compute instrs + load bytes/K-tile from SASS (not src):
- Disassemble:
cuobjdump -sass kernel.sm_86.cubin > kernel.sass - Count compute/tile (inner K-tile loop):
→ FP16 TC opsgrep -c 'HMMA' kernel.sass
→ INT8 TC opsgrep -c 'IMMA' kernel.sass
→ FP32 FMAgrep -c 'FFMA' kernel.sass
- Count global loads/tile:
→ global mem loadsgrep -c 'LDG' kernel.sass- Multiply bytes/load (typically 16 bytes for LDG.128)
- Ratio:
per tile.compute_ops / load_ops - Classify (cp.async threshold, gpu_reflections.md Insight 2):
- High (>20:1): cp.async net-neg; warp interleaving already hides DRAM latency. Focus algorithmic. Ref: Flash Attention 64 HMMA/tile = high, cp.async -5%.
- Medium (5-20:1): cp.async may help, benchmark both paths.
- Low (<5:1): cp.async strongly beneficial; loads dominate, async copy hides latency. Ref: IGEMM 8 IMMA/tile = low, cp.async +35%.
→ Compute/load ratio + classification (high/medium/low) + cp.async rec.
If err: Count from SASS not src — compiler may fuse, eliminate, reorder. Inner loop only (K-tile iter) not entire kernel.
Step 5: SASS Instr Inspect
Full SASS instr mix + stall codes:
- Disassemble (if not Step 4):
cuobjdump -sass kernel.sm_86.cubin > kernel.sass - Count instr types:
grep -c 'HMMA.16816' kernel.sass # FP16 Tensor Core grep -c 'IMMA.16816' kernel.sass # INT8 Tensor Core grep -c 'FFMA' kernel.sass # FP32 fused multiply-add grep -c 'LDGSTS' kernel.sass # cp.async (global->shared) grep -c 'LDG' kernel.sass # Global load grep -c 'STS' kernel.sass # Shared store grep -c 'LDS' kernel.sass # Shared load grep -c 'BAR.SYNC' kernel.sass # Barrier synchronization grep -c 'SHFL' kernel.sass # Warp shuffle (reductions) grep -c 'MUFU' kernel.sass # Special function unit - Stall codes critical instrs:
grep 'HMMA' kernel.sass | head -5 # Expect S08 minimum (hardware constraint) grep 'IMMA' kernel.sass | head -5 # Compiler emits S04, reducible to S02 via CuAssembler grep 'FFMA' kernel.sass | head -5 # Check for S04 (reducible to S01 on independent FFMAs) - ID opt targets:
- HMMA S08: hardware min Ampere, no reduce. Focus elsewhere.
- IMMA S04: compiler conservative. CuAssembler → S02 (15-20% gain).
- FFMA S04: independent → S01 via CuAssembler.
- Excessive BAR.SYNC: over-sync between pipeline stages.
→ Instr count table + stall code summary + ID'd opt targets.
If err:
cuobjdump arch matches kernel compile target (both sm_86). SASS out empty → cubin corrupt → recompile.
Step 6: Smem Cliff
Smem usage crosses arch-specific occupancy cliff?
- Read smem/block from
(Step 3) or--resource-usage
.cuobjdump --res-usage kernel.sm_86.cubin - Vs cliff:
- GA104 (sm_86): 100 KB max smem/SM. Cliff at 50 KB/block.
- Confirmed: 48 KB/block → 2 blocks/SM (good), 56 KB/block → 1 block/SM (2x regression).
- Above cliff (smem >50 KB/block):
- Blocks/SM drops to 1, active warps drop to warps_per_block (typically 4).
- 2x regression from exposed DRAM stalls.
- Double-buffering impact: Doubles smem. 30 KB current → 60 KB double-buf → crosses cliff. Eval async benefit vs occupancy loss.
- Record smem/block, blocks/SM, cliff crossed?
→ Smem/block + blocks/SM + explicit statement cliff crossed.
If err: Above cliff + occupancy bottleneck → change strategy: reduce tile → smem <50 KB, or accept 1 block/SM + compensate higher compute/load ratio (more register reuse, longer K-tiles).
Step 7: Decision Matrix
Synthesize Steps 2-6 → opt strategy:
| Condition | Strategy |
|---|---|
| Memory-bound + low compute/load (<5:1) + smem under cliff | SW pipelining cp.async (LDGSTS). Overlap global loads w/ compute. |
| Memory-bound + high compute/load (>20:1) + 8+ warps | Warp interleaving already hides. Focus algorithmic: implicit GEMM, split-Q, im2col. |
| Compute-bound + FFMA-heavy | CuAssembler stall tighten: S04 → S01 on independent FFMAs. |
| Compute-bound + HMMA-heavy | S08 hardware min, no reduce. Increase tile reuse (larger M/N, longer K-loop). |
| Compute-bound + IMMA-heavy | CuAssembler: S04 → S02 on IMMA (compiler conservative). |
| Latency-bound (low occupancy) | Reduce smem/registers → more blocks/SM. >8 warps/SM. |
| Smem above cliff | Reduce tile or restructure → smem/block <50 KB (GA104). |
- Rank strategies by expected gain, via compute/load + occupancy data.
- Estimate gain range per strategy, how far from relevant ceiling.
- Flag conflicts: cp.async doubles smem (may cross cliff), larger tiles → register pressure (may reduce occupancy).
→ Ranked list recommended opts + predicted gain + conflicts.
If err: No clear winner → micro-benchmarks isolate each (cp.async alone, reduced tile alone) → measure actual pre-combine.
Step 8: Doc Findings
Structured bottleneck report:
- Baseline: kernel ms, effective GFLOPS + BW, problem dims.
- Roofline: AI, classification, attained fraction.
- Occupancy: blocks/SM, active warps/SM, limiting factor.
- Compute/load: ratio, classification, cp.async rec.
- SASS summary: instr counts, stall findings, CuAssembler targets.
- Smem cliff: smem/block, blocks/SM, status.
- Rec: ranked opt strategies + gain estimates.
## Bottleneck Analysis Report: [kernel_name] ### Baseline - Problem: [dimensions] - Kernel time: [X] ms - Effective GFLOPS: [Y] | Effective BW: [Z] GB/s ### Roofline Classification - Arithmetic intensity: [AI] FLOP/byte - Balance point: [BP] FLOP/byte ([precision]) - Classification: **[compute|memory|latency]-bound** - Attained fraction: [X]% of peak ### Occupancy | Resource | Per Block | Limit/SM | Blocks/SM | |----------|-----------|----------|-----------| | Registers | [N]/thread | 65536 | [B] | | Shared mem | [X] KB | 100 KB (cliff: 50 KB) | [B] | | Warps | [W] | 48 | [B] | | **Limiting** | | | **[min(B)]** | - Active warps/SM: [W] ([sufficient|insufficient] for latency hiding) ### Compute/Load Ratio - Compute ops/tile: [N] [HMMA|IMMA|FFMA] - Load bytes/tile: [N] bytes ([N] LDG x [N] bytes) - Ratio: [X]:1 — **[high|medium|low]** - cp.async recommendation: [beneficial|neutral|detrimental] ### SASS Instruction Mix | Instruction | Count | Notes | |-------------|-------|-------| | HMMA.16816 | [N] | Stall: S08 (hardware min) | | IMMA.16816 | [N] | Stall: S04 (reducible to S02) | | FFMA | [N] | Stall: S04 (reducible to S01) | | LDG | [N] | | | LDGSTS | [N] | cp.async | | BAR.SYNC | [N] | | ### Smem Cliff - Smem/block: [X] KB — [under|over] 50 KB cliff - Blocks/SM: [B] — [no occupancy loss|occupancy halved] ### Recommended Optimizations (ranked) 1. [Strategy] — estimated [X-Y]% gain 2. [Strategy] — estimated [X-Y]% gain 3. [Strategy] — estimated [X-Y]% gain
→ Complete MD report consumable by kernel-optimizer agent or dev.
If err: Re-run different sizes (1024, 2048, 4096, 8192) → confirm not size-specific. Small may appear latency-bound when real bottleneck at scale is BW.
Check
- Baseline via CUDA events (not wall-clock)
- Roofline classification (compute/memory/latency bound)
- Occupancy + limiting factor
- Compute/load ratio/tile from SASS
- SASS instr mix + stall codes documented
- Smem cliff vs arch threshold
- Decision matrix + strategy rec
- Findings in structured report
Traps
- Re-read multiply: Direct conv2d reads weight 9x no im2col → byte count inflated 9x. Use actual unique bytes from DRAM, not total load instrs, for AI.
- Confuse FP16 TC peak w/ FP32: FP16 TC peak 174 TFLOPS, FP32 FFMA 21.7 TFLOPS — 8x diff. Wrong peak → roofline meaningless.
- Using 64 KB cliff not 50 KB GA104: GA104 (sm_86) 100 KB max smem/SM. Cliff 100/2 = 50 KB/block, not 64 KB. Arch-specific; other GPUs differ.
- Ignore warp interleaving when eval cp.async: 8 warps long compute (high compute/load) already hide DRAM via warp sched. cp.async → smem pressure + barrier overhead no benefit (Flash Attention -5%).
- Count instrs from src not SASS: Compiler may fuse, eliminate dead, unroll differently, reorder. Always from
.cuobjdump -sass - No warmup iters: 1st launch → JIT compile overhead + cold cache. 2-5 warmup pre-measured run.
→
— impl SW pipelining cp.async when memory-bound + low compute/loadpipeline-gpu-kernel
— complementary arch analysis CPU-side bottlenecks in host-device workflowssimulate-cpu-architecture