Agent-almanac analyze-kernel-bottleneck

install

source · Clone the upstream repo

git clone https://github.com/pjt222/agent-almanac

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/pjt222/agent-almanac "$T" && mkdir -p ~/.claude/skills && cp -r "$T/i18n/caveman-ultra/skills/analyze-kernel-bottleneck" ~/.claude/skills/pjt222-agent-almanac-analyze-kernel-bottleneck-f50632 && rm -rf "$T"

manifest: i18n/caveman-ultra/skills/analyze-kernel-bottleneck/SKILL.md

source content

Analyze Kernel Bottleneck

Identify GPU kernel = compute-bound, memory-bound, latency-bound. Baseline perf → roofline classify → occupancy + compute/load ratio/tile → SASS instr mix + stall codes → smem cliff → decision matrix → right opt strategy.

Use When

Pre-opt any CUDA kernel → baseline + classify
After 1st working ver → ID opt path
Underperforms vs theoretical peak
Deciding cp.async vs larger tiles vs algorithmic restructure

In

Required: Compiled kernel (
```
.cubin
```
or
```
.cu
```
+ build cmd)
Required: Bench harness launching via CUDA event timing
Required: Problem dims (M, N, K for GEMM; seq_len, heads, head_dim for attention)
Optional: Target GPU arch (default: GA104 / sm_86 / RTX 3070 Ti)
Optional: Expected peak util % for compare
Optional: Prior profiling data (Nsight Compute)

Do

Step 1: Baseline Perf

Run kernel w/ CUDA events (

BenchTimer

), record ms. Calc effective throughput:

Compile if not built:

nvcc --cubin -arch=sm_86 -O2 -o kernel.sm_86.cubin kernel.cu
nvcc -arch=sm_86 -O2 -o bench bench.cu -lcuda -I../../phase2/common

Run representative sizes, warmup pre-measurement:

./bench 4096 4096 4096

Record kernel ms from CUDA events (not wall-clock).

Calc effective GFLOPS + BW:

GEMM:

effective_gflops = (2 * M * N * K) / (time_ms / 1000) / 1e9

BW-limited:

effective_bw = total_bytes / (time_ms / 1000) / 1e9

Flash Attention:

effective_gflops = (4 * batch * heads * seq_len^2 * head_dim) / (time_ms / 1000) / 1e9

→ Baseline: kernel ms, effective GFLOPS, effective BW.

If err: Check launches no err (

CHECK_CU

). Warmup pre-measurement. Dims large enough saturate GPU (small → launch overhead bottleneck).

Step 2: Roofline Classify

Arithmetic intensity vs machine balance → classify:

Calc AI:
```
AI = FLOPs / bytes_loaded_from_global_memory
```
. Count only unique bytes from DRAM (not shared mem or register reuse).
Lookup balance:
```
balance = peak_compute / peak_bandwidth
```
.
Classify:
```
AI < balance
```
→ memory-bound.
```
AI > balance
```
→ compute-bound.

GA104 (RTX 3070 Ti) Reference:

Resource	Peak	Unit
FP32 FFMA	21.7	TFLOPS
FP16 Tensor Core (HMMA)	174	TFLOPS
INT8 Tensor Core (IMMA)	696	TOPS
DRAM Bandwidth	608	GB/s
L2 Cache	4	MB
SMs	48

Derived Balance Points:

Precision	Balance Point (FLOP/byte)
FP32 FFMA	21700 / 608 = 35.7
FP16 TC	174000 / 608 = 286.2
INT8 TC	696000 / 608 = 1144.7

Compute attained:
```
attained = effective_throughput / peak_throughput
```
. Memory-bound → compare effective BW to 608 GB/s. Compute-bound → compare effective GFLOPS to relevant peak.

→ Classification: compute-bound, memory-bound, latency-bound (low occupancy → neither saturated) + numerical justification.

If err: Recheck byte counting. Watch redundant re-reads (e.g., 9x in direct conv2d no im2col). Neither saturated → latency-bound (Step 3).

Step 3: Occupancy

Active warps/SM from launch config + resource usage:

Extract resource usage:

nvcc --cubin -arch=sm_86 -O2 --resource-usage -o kernel.sm_86.cubin kernel.cu 2>&1 | grep -E 'registers|smem'

Launch config:

warps_per_block = threads_per_block / 32

Blocks/SM per limiting factor:

floor(65536 / (registers_per_thread * threads_per_block))

Smem:

floor(available_smem_per_SM / smem_per_block)

→ see Step 6 cliff

Warp:
```
floor(48 / warps_per_block)
```
(GA104 max: 48 warps/SM)
Block: 16 blocks/SM max GA104

Actual blocks/SM =

min(register_limit, smem_limit, warp_limit, block_limit)

Active warps/SM =
```
blocks_per_SM * warps_per_block
```
.
Key threshold: 8 warps/SM enough latency hiding GA104. <8 = structural → latency-bound.

→ Occupancy table: blocks/SM, active warps/SM, limiting factor (registers, smem, warps).

If err: Check

cuFuncSetAttribute

for dynamic smem. Verify

--resource-usage

matches actual launch config. High register →

--maxrregcount=N

(trade spills for occupancy).

Step 4: Compute/Load Ratio/Tile

Count compute instrs + load bytes/K-tile from SASS (not src):

Disassemble:

cuobjdump -sass kernel.sm_86.cubin > kernel.sass

Count compute/tile (inner K-tile loop):
- ```
grep -c 'HMMA' kernel.sass
```
  → FP16 TC ops
- ```
grep -c 'IMMA' kernel.sass
```
  → INT8 TC ops
- ```
grep -c 'FFMA' kernel.sass
```
  → FP32 FMA
Count global loads/tile:
- ```
grep -c 'LDG' kernel.sass
```
  → global mem loads
- Multiply bytes/load (typically 16 bytes for LDG.128)
Ratio:
```
compute_ops / load_ops
```
per tile.
Classify (cp.async threshold, gpu_reflections.md Insight 2):
- High (>20:1): cp.async net-neg; warp interleaving already hides DRAM latency. Focus algorithmic. Ref: Flash Attention 64 HMMA/tile = high, cp.async -5%.
- Medium (5-20:1): cp.async may help, benchmark both paths.
- Low (<5:1): cp.async strongly beneficial; loads dominate, async copy hides latency. Ref: IGEMM 8 IMMA/tile = low, cp.async +35%.

→ Compute/load ratio + classification (high/medium/low) + cp.async rec.

If err: Count from SASS not src — compiler may fuse, eliminate, reorder. Inner loop only (K-tile iter) not entire kernel.

Step 5: SASS Instr Inspect

Full SASS instr mix + stall codes:

Disassemble (if not Step 4):

cuobjdump -sass kernel.sm_86.cubin > kernel.sass

Count instr types:

grep -c 'HMMA.16816' kernel.sass      # FP16 Tensor Core
grep -c 'IMMA.16816' kernel.sass      # INT8 Tensor Core
grep -c 'FFMA' kernel.sass            # FP32 fused multiply-add
grep -c 'LDGSTS' kernel.sass          # cp.async (global->shared)
grep -c 'LDG' kernel.sass             # Global load
grep -c 'STS' kernel.sass             # Shared store
grep -c 'LDS' kernel.sass             # Shared load
grep -c 'BAR.SYNC' kernel.sass        # Barrier synchronization
grep -c 'SHFL' kernel.sass            # Warp shuffle (reductions)
grep -c 'MUFU' kernel.sass            # Special function unit

Stall codes critical instrs:

grep 'HMMA' kernel.sass | head -5     # Expect S08 minimum (hardware constraint)
grep 'IMMA' kernel.sass | head -5     # Compiler emits S04, reducible to S02 via CuAssembler
grep 'FFMA' kernel.sass | head -5     # Check for S04 (reducible to S01 on independent FFMAs)

ID opt targets:
- HMMA S08: hardware min Ampere, no reduce. Focus elsewhere.
- IMMA S04: compiler conservative. CuAssembler → S02 (15-20% gain).
- FFMA S04: independent → S01 via CuAssembler.
- Excessive BAR.SYNC: over-sync between pipeline stages.

→ Instr count table + stall code summary + ID'd opt targets.

If err:

cuobjdump

arch matches kernel compile target (both sm_86). SASS out empty → cubin corrupt → recompile.

Step 6: Smem Cliff

Smem usage crosses arch-specific occupancy cliff?

Read smem/block from

--resource-usage

(Step 3) or

cuobjdump --res-usage kernel.sm_86.cubin

Vs cliff:
- GA104 (sm_86): 100 KB max smem/SM. Cliff at 50 KB/block.
- Confirmed: 48 KB/block → 2 blocks/SM (good), 56 KB/block → 1 block/SM (2x regression).
Above cliff (smem >50 KB/block):
- Blocks/SM drops to 1, active warps drop to warps_per_block (typically 4).
- 2x regression from exposed DRAM stalls.
Double-buffering impact: Doubles smem. 30 KB current → 60 KB double-buf → crosses cliff. Eval async benefit vs occupancy loss.
Record smem/block, blocks/SM, cliff crossed?

→ Smem/block + blocks/SM + explicit statement cliff crossed.

If err: Above cliff + occupancy bottleneck → change strategy: reduce tile → smem <50 KB, or accept 1 block/SM + compensate higher compute/load ratio (more register reuse, longer K-tiles).

Step 7: Decision Matrix

Synthesize Steps 2-6 → opt strategy:

Condition	Strategy
Memory-bound + low compute/load (<5:1) + smem under cliff	SW pipelining cp.async (LDGSTS). Overlap global loads w/ compute.
Memory-bound + high compute/load (>20:1) + 8+ warps	Warp interleaving already hides. Focus algorithmic: implicit GEMM, split-Q, im2col.
Compute-bound + FFMA-heavy	CuAssembler stall tighten: S04 → S01 on independent FFMAs.
Compute-bound + HMMA-heavy	S08 hardware min, no reduce. Increase tile reuse (larger M/N, longer K-loop).
Compute-bound + IMMA-heavy	CuAssembler: S04 → S02 on IMMA (compiler conservative).
Latency-bound (low occupancy)	Reduce smem/registers → more blocks/SM. >8 warps/SM.
Smem above cliff	Reduce tile or restructure → smem/block <50 KB (GA104).

Rank strategies by expected gain, via compute/load + occupancy data.
Estimate gain range per strategy, how far from relevant ceiling.
Flag conflicts: cp.async doubles smem (may cross cliff), larger tiles → register pressure (may reduce occupancy).

→ Ranked list recommended opts + predicted gain + conflicts.

If err: No clear winner → micro-benchmarks isolate each (cp.async alone, reduced tile alone) → measure actual pre-combine.

Step 8: Doc Findings

Structured bottleneck report:

Baseline: kernel ms, effective GFLOPS + BW, problem dims.
Roofline: AI, classification, attained fraction.
Occupancy: blocks/SM, active warps/SM, limiting factor.
Compute/load: ratio, classification, cp.async rec.
SASS summary: instr counts, stall findings, CuAssembler targets.
Smem cliff: smem/block, blocks/SM, status.
Rec: ranked opt strategies + gain estimates.

## Bottleneck Analysis Report: [kernel_name]

### Baseline
- Problem: [dimensions]
- Kernel time: [X] ms
- Effective GFLOPS: [Y] | Effective BW: [Z] GB/s

### Roofline Classification
- Arithmetic intensity: [AI] FLOP/byte
- Balance point: [BP] FLOP/byte ([precision])
- Classification: **[compute|memory|latency]-bound**
- Attained fraction: [X]% of peak

### Occupancy
| Resource | Per Block | Limit/SM | Blocks/SM |
|----------|-----------|----------|-----------|
| Registers | [N]/thread | 65536 | [B] |
| Shared mem | [X] KB | 100 KB (cliff: 50 KB) | [B] |
| Warps | [W] | 48 | [B] |
| **Limiting** | | | **[min(B)]** |
- Active warps/SM: [W] ([sufficient|insufficient] for latency hiding)

### Compute/Load Ratio
- Compute ops/tile: [N] [HMMA|IMMA|FFMA]
- Load bytes/tile: [N] bytes ([N] LDG x [N] bytes)
- Ratio: [X]:1 — **[high|medium|low]**
- cp.async recommendation: [beneficial|neutral|detrimental]

### SASS Instruction Mix
| Instruction | Count | Notes |
|-------------|-------|-------|
| HMMA.16816 | [N] | Stall: S08 (hardware min) |
| IMMA.16816 | [N] | Stall: S04 (reducible to S02) |
| FFMA | [N] | Stall: S04 (reducible to S01) |
| LDG | [N] | |
| LDGSTS | [N] | cp.async |
| BAR.SYNC | [N] | |

### Smem Cliff
- Smem/block: [X] KB — [under|over] 50 KB cliff
- Blocks/SM: [B] — [no occupancy loss|occupancy halved]

### Recommended Optimizations (ranked)
1. [Strategy] — estimated [X-Y]% gain
2. [Strategy] — estimated [X-Y]% gain
3. [Strategy] — estimated [X-Y]% gain

→ Complete MD report consumable by kernel-optimizer agent or dev.

If err: Re-run different sizes (1024, 2048, 4096, 8192) → confirm not size-specific. Small may appear latency-bound when real bottleneck at scale is BW.

Check

Baseline via CUDA events (not wall-clock)
Roofline classification (compute/memory/latency bound)
Occupancy + limiting factor
Compute/load ratio/tile from SASS
SASS instr mix + stall codes documented
Smem cliff vs arch threshold
Decision matrix + strategy rec
Findings in structured report

Traps

Re-read multiply: Direct conv2d reads weight 9x no im2col → byte count inflated 9x. Use actual unique bytes from DRAM, not total load instrs, for AI.
Confuse FP16 TC peak w/ FP32: FP16 TC peak 174 TFLOPS, FP32 FFMA 21.7 TFLOPS — 8x diff. Wrong peak → roofline meaningless.
Using 64 KB cliff not 50 KB GA104: GA104 (sm_86) 100 KB max smem/SM. Cliff 100/2 = 50 KB/block, not 64 KB. Arch-specific; other GPUs differ.
Ignore warp interleaving when eval cp.async: 8 warps long compute (high compute/load) already hide DRAM via warp sched. cp.async → smem pressure + barrier overhead no benefit (Flash Attention -5%).
Count instrs from src not SASS: Compiler may fuse, eliminate dead, unroll differently, reorder. Always from
```
cuobjdump -sass
```
.
No warmup iters: 1st launch → JIT compile overhead + cold cache. 2-5 warmup pre-measured run.

→

```
pipeline-gpu-kernel
```
— impl SW pipelining cp.async when memory-bound + low compute/load
```
simulate-cpu-architecture
```
— complementary arch analysis CPU-side bottlenecks in host-device workflows