Agentops perf
Performance profiling, benchmarking, regression detection, and optimization. Triggers: "perf", "performance", "benchmark", "profile", "slow", "optimize", "latency", "throughput", "memory leak", "perf regression".
git clone https://github.com/boshu2/agentops
T=$(mktemp -d) && git clone --depth=1 https://github.com/boshu2/agentops "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/perf" ~/.claude/skills/boshu2-agentops-perf-73b9ef && rm -rf "$T"
skills/perf/SKILL.mdPerf Skill
Quick Ref:
|/perf profile <target>|/perf bench <target>|/perf compare <baseline> <candidate>/perf optimize <target>
YOU MUST EXECUTE THIS WORKFLOW. Do not just describe it.
Performance profiling, benchmarking, regression detection, and optimization recommendations for any language runtime. Produces actionable metrics, not vague advice.
Modes
| Mode | Command | Purpose |
|---|---|---|
| Profile | | Profile execution, find hotspots |
| Benchmark | | Create or run benchmarks |
| Compare | | Compare two runs for regression |
| Optimize | | Analyze and apply optimizations |
If no mode is specified, default to profile.
Step 0: Detect Language and Tooling
Identify the language/runtime from file extensions,
go.mod, package.json, pyproject.toml, Cargo.toml, or explicit user input. Select the profiling stack:
| Language | Benchmarking | CPU Profile | Memory Profile | Comparison |
|---|---|---|---|---|
| Go | | (cpu) | (alloc) | |
| Python | , | , | , | manual diff |
| Node | , | , | , | manual diff |
| Rust | , | | , | |
| Shell | | , | N/A | built-in |
Check which tools are actually installed. If a preferred tool is missing, fall back to standard-library alternatives before asking the user to install anything.
Step 1: Establish Baseline
Run existing benchmarks first. If none exist, create them.
1a. Find Existing Benchmarks
# Go grep -r "func Benchmark" --include="*_test.go" -l . # Python find . -name "test_*" -exec grep -l "benchmark\|@pytest.mark.benchmark" {} + # Rust grep -r "#\[bench\]" --include="*.rs" -l . # Node find . -name "*.bench.*" -o -name "*.benchmark.*"
1b. Run or Create Benchmarks
If benchmarks exist for the target, run them and capture output. If none exist, write benchmarks covering the target function or module.
Benchmark requirements:
- Measure wall-clock time (ops/sec or ns/op)
- Measure memory allocations (bytes/op, allocs/op)
- Run enough iterations for statistical stability (Go:
)-benchtime=3s -count=5 - Record latency percentiles where applicable: p50, p95, p99
Save raw baseline output to
.agents/perf/baseline-YYYY-MM-DD.txt.
1c. Go Benchmark Template
func BenchmarkTargetFunction(b *testing.B) { // Setup outside the loop input := prepareInput() b.ResetTimer() b.ReportAllocs() for i := 0; i < b.N; i++ { TargetFunction(input) } }
1d. Python Benchmark Template
import pytest @pytest.mark.benchmark(group="target") def test_target_benchmark(benchmark): input_data = prepare_input() result = benchmark(target_function, input_data) assert result is not None
Step 2: Profile and Identify Hotspots
CPU Profiling
Find functions consuming the most CPU time.
Go:
go test -bench=BenchmarkTarget -cpuprofile=cpu.prof ./... go tool pprof -top cpu.prof go tool pprof -text -cum cpu.prof # cumulative view
Python:
python -m cProfile -s cumulative target_script.py # Or for running processes: py-spy top --pid <PID> py-spy record -o profile.svg --pid <PID>
Memory Profiling
Find allocation hotspots and potential leaks.
Go:
go test -bench=BenchmarkTarget -memprofile=mem.prof ./... go tool pprof -top -alloc_space mem.prof
Python:
python -m memory_profiler target_script.py # Or with tracemalloc in code: # tracemalloc.start(); ...; snapshot = tracemalloc.take_snapshot()
I/O Profiling
Identify blocking operations in hot paths.
- Check for synchronous file I/O, network calls, or database queries inside loops.
- Look for missing connection pooling, unbuffered writes, or serial HTTP calls that could be concurrent.
Hotspot Summary
After profiling, produce a ranked list:
HOTSPOTS (by cumulative CPU time): 1. pkg/engine.Process 42.3% (1.2s) — main processing loop 2. pkg/engine.parseRecord 28.1% (0.8s) — record deserialization 3. pkg/io.ReadBatch 15.7% (0.45s) — disk reads
Step 3: Analyze and Recommend
Classify each finding by estimated impact:
| Impact | Criteria | Action |
|---|---|---|
| High | >20% of total time or >50% of allocations | Fix immediately |
| Medium | 5-20% of total time or notable allocation waste | Fix in this session |
| Low | <5% of total time, minor inefficiency | Log for later |
Common Anti-Patterns
Check the profiled code against these known performance killers:
- Unnecessary allocations — allocating inside hot loops, string concatenation in loops (use
/strings.Builder
/[]byte
)io.StringWriter - N+1 queries — database call per item instead of batch query
- Missing caching — recomputing expensive results that rarely change
- Blocking I/O in hot path — synchronous network/disk calls where async or buffered I/O would work
- Excessive copying — passing large structs by value, copying slices instead of slicing
- Suboptimal data structures — linear search where a map lookup works, unbounded slice growth without pre-allocation
- Lock contention — mutex held across I/O or long computation
- Regex compilation in loops — compile once, reuse the compiled pattern
- Reflection in hot paths — replace with code generation or type switches
- Unbuffered channels — causing goroutine scheduling overhead in Go
For each finding, state:
- What: the specific code location and pattern
- Why: how it hurts performance (with numbers from profiling)
- Fix: concrete code change recommendation
Step 4: Optimize (optimize mode only)
Critical rule: ONE optimization at a time.
For each optimization:
- Describe the change before making it
- Apply the single change
- Re-run the benchmark suite
- Compare results against baseline using
(Go) or manual diffbenchstat - Keep or revert — only keep changes that measurably improve metrics
- Commit with message format:
orperf(<scope>): <description> (+X% throughput)perf(<scope>): <description> (-X% latency)
Acceptance Criteria
- Improvement must be statistically significant (p < 0.05 for
, or >5% consistent change for manual comparison)benchstat - No correctness regressions — all existing tests must still pass
- No readability destruction for marginal gains (<2% improvement does not justify obfuscated code)
Optimization Order
Apply optimizations in this order (highest expected impact first):
- Algorithmic improvements (O(n^2) to O(n log n), etc.)
- Allocation reduction (pre-allocate, pool, reuse buffers)
- I/O batching and buffering
- Caching and memoization
- Concurrency improvements (parallelize independent work)
- Micro-optimizations (only if profiling confirms they matter)
Step 5: Output Report
Write the report to
.agents/perf/YYYY-MM-DD-perf-<target>.md.
Report Template
# Performance Report: <target> Date: YYYY-MM-DD Mode: <profile|bench|compare|optimize> Language: <detected> ## Summary <1-2 sentence summary of findings> ## Baseline Metrics | Metric | Value | |--------|-------| | ops/sec | ... | | ns/op | ... | | B/op | ... | | allocs/op | ... | | p50 latency | ... | | p95 latency | ... | | p99 latency | ... | ## Hotspots <ranked list from Step 2> ## Findings <classified findings from Step 3> ## Optimizations Applied (if optimize mode) | Change | Before | After | Improvement | |--------|--------|-------|-------------| | ... | ... | ... | +X% | ## After Metrics (if optimize mode) <same table as baseline, with new values> ## Recommendations <remaining opportunities not addressed in this session>
Compare Mode Details
When running
/perf compare <baseline> <candidate>:
- Locate or re-run benchmarks for both versions
- Use language-native comparison tools:
- Go:
benchstat baseline.txt candidate.txt - Rust:
critcmp baseline candidate - Other: side-by-side table with percentage deltas
- Go:
- Flag regressions (>5% slower or >10% more allocations) as REGRESSION
- Flag improvements (>5% faster or >10% fewer allocations) as IMPROVEMENT
- Flag statistically insignificant changes as NOISE
Output a summary table:
COMPARISON: baseline vs candidate | Benchmark | Baseline | Candidate | Delta | Verdict | |-----------|----------|-----------|-------|---------| | BenchmarkProcess | 1.2ms | 0.9ms | -25% | IMPROVEMENT | | BenchmarkParse | 450ns | 480ns | +6.7% | REGRESSION | | BenchmarkIO | 3.1ms | 3.0ms | -3.2% | NOISE |
Edge Cases
- No benchmarks and no clear target: Run
first to identify hot paths, then benchmark those./complexity - Flaky benchmarks: Increase iteration count, pin to a single core (
), close competing processes.GOMAXPROCS=1 - Cannot install profiling tools: Fall back to
for wall-clock and manual instrumentation for allocation counts.time - Target is a CLI command: Use
for wall-clock benchmarking across any language.hyperfine
See Also
- complexity — Find high-complexity code to target
- standards — Language-specific optimization patterns
- vibe — Validate optimized code quality