Claude-code-skills ln-811-performance-profiler

Profiles runtime performance with CPU, memory, and I/O metrics. Use when measuring bottlenecks before optimization.

install
source · Clone the upstream repo
git clone https://github.com/levnikolaevich/claude-code-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/levnikolaevich/claude-code-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills-catalog/ln-811-performance-profiler" ~/.claude/skills/levnikolaevich-claude-code-skills-ln-811-performance-profiler && rm -rf "$T"
manifest: skills-catalog/ln-811-performance-profiler/SKILL.md
source content

Paths: File paths (

shared/
,
references/
,
../ln-*
) are relative to skills repo root. If not found at CWD, locate this SKILL.md directory and go up one level for repo root. If
shared/
is missing, fetch files via WebFetch from
https://raw.githubusercontent.com/levnikolaevich/claude-code-skills/master/skills/{path}
.

ln-811-performance-profiler

Type: L3 Worker Category: 8XX Optimization

Runtime profiler that executes the optimization target, measures multiple metrics (CPU, memory, I/O, time), instruments code for per-function breakdown, and produces a standardized performance map from real data.


Overview

AspectDetails
InputProblem statement: target (file/endpoint/pipeline) + observed metric
OutputPerformance map (multi-metric, per-function), suspicion stack, bottleneck classification
PatternDiscover test → Baseline run → Static analysis → Deep profile → Performance map → Report

Workflow

Phases: Test Discovery → Baseline Run → Static Analysis → Deep Profile → Performance Map → Report


Phase 0: Test Discovery/Creation

MANDATORY READ: Load

shared/references/ci_tool_detection.md
for test framework detection. MANDATORY READ: Load
shared/references/benchmark_generation.md
for auto-generating benchmarks when none exist.

Find or create commands that exercise the optimization target. Two outputs:

test_command
(profiling/measurement) and
e2e_test_command
(functional safety gate).

Step 1: Discover test_command

PriorityMethodAction
1User-providedUser specifies test command or API endpoint
2Discover existing E2E testGrep test files for target entry point (stop at first match)
3Create test scriptGenerate per
shared/references/benchmark_generation.md
to
.hex-skills/optimization/{slug}/profile_test.sh

E2E discovery protocol (stop at first match):

PriorityMethodHow
1Route-based searchGrep e2e/integration test files for entry point route
2Function-based searchGrep for entry point function name
3Module-based searchGrep for import of entry point module

Test creation (if no existing test found):

Target TypeGenerated Script
API endpoint
curl -w "%{time_total}" -o /dev/null -s {endpoint}
FunctionStack-specific benchmark per
shared/references/benchmark_generation.md
PipelineFull pipeline invocation with test input

Step 2: Discover e2e_test_command

If

test_command
came from E2E discovery (Step 1 priority 2):
e2e_test_command = test_command
.

Otherwise, run E2E discovery protocol again (same 3-priority table) to find a separate functional safety test.

If not found:

e2e_test_command = null
, log:
WARNING: No e2e test covers {entry_point}. Full test suite serves as functional gate.

Output

FieldDescription
test_command
Command for profiling/measurement
e2e_test_command
Command for functional safety gate (may equal test_command, or null)
e2e_test_source
Discovery method: user / route / function / module / none

Phase 1: Baseline Run (Multi-Metric)

Run

test_command
with system-level profiling. Capture simultaneously:

MetricHow to CaptureWhen
Wall time
time
wrapper or test harness
Always
CPU time (user+sys)
/usr/bin/time -v
or language profiler
Always
Memory peak (RSS)
/usr/bin/time -v
(Max RSS) or
tracemalloc
/
process.memoryUsage()
Always
I/O bytes
/usr/bin/time -v
or structured logs
If I/O suspected
HTTP round-tripsCount from structured logs or application metricsIf network I/O in call graph
GPU utilization
nvidia-smi --query-gpu
Only if CUDA/GPU detected in stack

Baseline Protocol

ParameterValue
Runs3
MetricMedian
Warm-up1 discarded run
Output
baseline
— multi-metric snapshot

Monitor for Profiler Runs (Claude Code 2.1.98+)

MANDATORY READ: Load

shared/references/monitor_integration_pattern.md

During baseline and deep profile runs:

Monitor(command="{test_command} 2>&1", timeout_ms=300000, description="profiler run N")

Detect crashes or infinite loops immediately. Fallback:

Bash
with
timeout
.


Phase 2: Static Analysis → Instrumentation Points

MANDATORY READ: Load bottleneck_classification.md

Trace call chain from code + build suspicion stack. Purpose: guide WHERE to instrument in Phase 3.

Step 1: Trace Call Chain

Starting from entry point, trace depth-first (max depth 5). At each step, READ the full function body.

Cross-service tracing: If

service_topology
is available from coordinator and a step makes an HTTP/gRPC call to another service whose code is accessible:

SituationAction
HTTP call to service with code in submodule/monorepoFollow into that service's handler: resolve route → trace handler code (depth resets to 0 for the new service)
HTTP call to service without accessible codeClassify as External, record latency estimate
gRPC/message queue to known serviceSame as HTTP — follow into handler if code accessible

Record

service: "{service_name}"
on each step to track which service owns it. The performance_map
steps
tree can span multiple services.

Depth-First Rule: If code of the called service is accessible — ALWAYS profile INSIDE. NEVER classify an accessible service as "External/slow" without profiling its internals. "Slow" is a symptom, not a diagnosis.

5 Whys for each bottleneck: Before reporting a bottleneck, chain "why?" until you reach config/architecture level:

  1. "What is slow?" → alignment service (5.9s) 2. "Why?" → 6 pairs × ~1s each 3. "Why ~1s per pair?" → O(n²) mwmf computation 4. "Why O(n²)?" → library default, not production config 5. "Why default?" →
    matching_methods
    not configured → root cause = config

Step 2: Classify & Suspicion Scan

For each step, classify by type (CPU, I/O-DB, I/O-Network, I/O-File, Architecture, External, Cache) and scan for performance concerns.

Suspicion checklist (minimum, not limitation):

CategoryWhat to Look For
Connection managementClient created per-request? Missing pooling? Missing reuse?
Data flowData read multiple times? Over-fetching? Unnecessary transforms?
Async patternsSync I/O in async context? Sequential awaits without data dependency?
Resource lifecycleUnclosed connections? Temp files? Memory accumulation in loop?
ConfigurationHardcoded timeouts? Default pool sizes? Missing batch size config?
Redundant workSame validation at multiple layers? Same data loaded twice?
ArchitectureN+1 in loop? Batch API unused? Cache infra unused? Sequential-when-parallel?
(open)Anything else spotted — checklist does not limit findings

Step 2b: Suspicion Deduplication

MANDATORY READ: Load

shared/references/output_normalization.md

After generating suspicions across all call chain steps, normalize and deduplicate per §1-§2:

  • Normalize suspicion descriptions (replace specific values with placeholders)
  • Group identical suspicions across different steps → merge into single entry with
    affected_steps: [list]
  • Example: "Missing connection pooling" found in steps 1.1, 1.2, 1.3 → one suspicion with
    affected_steps: ["1.1", "1.2", "1.3"]

Step 3: Verify & Map to Instrumentation Points

FOR each suspicion:
  1. VERIFY: follow code to confirm or dismiss
  2. VERDICT: CONFIRMED → map to instrumentation point | DISMISSED → log reason
  3. For each CONFIRMED suspicion, identify:
     - function to wrap with timing
     - I/O call to count
     - memory allocation to track

Profiler Selection (per stack)

StackNon-invasive profilerInvasive (if non-invasive insufficient)
Python
py-spy
,
cProfile
time.perf_counter()
decorators
Node.js
clinic
,
--prof
console.time()
wrappers
Go
pprof
(built-in)
Usually not needed
.NET
dotnet-trace
Stopwatch
wrappers
Rust
cargo flamegraph
std::time::Instant

Stack detection: per

shared/references/ci_tool_detection.md
.


Phase 3: Deep Profile

Profiler Hierarchy (escalate as needed)

LevelTool ExamplesWhat It ShowsWhen to Use
1
py-spy
,
cProfile
,
pprof
,
dotnet-trace
Function-level hotspotsAlways — first pass
2
line_profiler
, per-line timing
Line-level timing in hotspot functionHotspot function found but cause unclear
3
tracemalloc
,
memory_profiler
Per-line memory allocationMemory metrics abnormal in baseline

Step 1: Non-Invasive Profiling (preferred)

Run

test_command
with Level 1 profiler to get per-function breakdown without code changes.

Step 2: Escalation Decision

After Level 1 profiler run, evaluate result against suspicion stack from Phase 2:

Profiler ResultAction
Hotspot function identified, time breakdown confirms suspicionsDONE — proceed to Phase 4
Hotspot identified but internal cause unclear (CPU vs I/O inside one function)Escalate to Level 2 (line-level timing)
Memory baseline abnormal (peak or delta)Escalate to Level 3 (memory profiler)
Multiple suspicions unresolved — profiler granularity insufficientGo to Step 3 (targeted instrumentation)
Profiler unavailable or overhead > 20% of wall timeGo to Step 3 (targeted instrumentation)

Stop Conditions (Profiler Escalation)

ConditionAction
Hotspot identified with clear causeSTOP — proceed to Performance Map
All 3 profiler levels exhaustedSTOP — build map from best available data
Instrumentation breaks testsSTOP — revert instrumentation, use non-invasive data only
Profiler overhead > 20% of wall timeSTOP — skip to targeted instrumentation

Step 3: Targeted Instrumentation (proactive)

Add timing/logging along the call stack at instrumentation points identified in Phase 2 Step 3:

1. FOR each CONFIRMED suspicion without measured data:
     Add timing wrapper around target function/I/O call
     Add counter for I/O round-trips if network/DB suspected
     (cross-service: instrument in the correct service's codebase)
2. Re-run test_command (3 runs, median)
3. Collect per-function measurements from logs
4. Record list of instrumented files (may span multiple services)
Instrumentation TypeWhenExample
Timing wrapperAlways for unresolved suspicions
time.perf_counter()
around function call
I/O call counterNetwork or DB bottleneck suspectedCount HTTP requests, DB queries in loop
Memory snapshotMemory accumulation suspected
tracemalloc.get_traced_memory()
before/after

KEEP instrumentation in place. The executor reuses it for post-optimization per-function comparison, then cleans up after strike. Report

instrumented_files
in output.


Phase 4: Build Performance Map

Standardized format — feeds into

.hex-skills/optimization/{slug}/context.md
for downstream consumption.

performance_map:
  test_command: "uv run pytest tests/automated/e2e/test_example.py -s"
  baseline:
    wall_time_ms: 7280
    cpu_time_ms: 850
    memory_peak_mb: 256
    memory_delta_mb: 45
    io_read_bytes: 1200000
    io_write_bytes: 500000
    http_round_trips: 13
  steps:                          # service field present only in multi-service topology
    - id: "1"
      function: "process_job"
      location: "app/services/job_processor.py:45"
      service: "api"             # optional — which service owns this step
      wall_time_ms: 7200
      time_share_pct: 99
      type: "function_call"
      children:
        - id: "1.1"
          function: "translate_binary"
          wall_time_ms: 7100
          type: "function_call"
          children:
            - id: "1.1.1"
              function: "tikal_extract"
              service: "tikal"   # cross-service: code traced into submodule
              wall_time_ms: 2800
              type: "http_call"
              http_round_trips: 1
            - id: "1.1.2"
              function: "mt_translate"
              service: "mt-engine"
              wall_time_ms: 3500
              type: "http_call"
              http_round_trips: 13
  bottleneck_classification: "I/O-Network"
  bottleneck_detail: "13 sequential HTTP calls to MT service (3500ms)"
  top_bottlenecks:
    - step: "1.1.2", type: "I/O-Network", share: 48%
    - step: "1.1.1", type: "I/O-Network", share: 38%

Phase 5: Report

Report Structure

profile_result:
  entry_point_info:
    type: <string>                     # "api_endpoint" | "function" | "pipeline"
    location: <string>                 # file:line
    route: <string|null>               # API route (if endpoint)
    function: <string>                 # Entry point function name
  performance_map: <object>            # Full map from Phase 4
  bottleneck_classification: <string>  # Primary bottleneck type
  bottleneck_detail: <string>          # Human-readable description
  top_bottlenecks:
    - step, type, share, description
  optimization_hints:                  # CONFIRMED suspicions only (Phase 2)
    - hint with evidence
  suspicion_stack:                     # Full audit trail (confirmed + dismissed)
    - category: <string>
      location: <string>
      description: <string>
      verdict: <string>               # "confirmed" | "dismissed"
      evidence: <string>
      verification_note: <string>
  e2e_test:
    command: <string|null>             # E2E safety test command (from Phase 0)
    source: <string>                   # user / route / function / module / none
  instrumented_files: [<string>]       # Files with active instrumentation (empty if non-invasive only)
  wrong_tool_indicators: []            # Empty = proceed, non-empty = exit

Wrong Tool Indicators

IndicatorCondition
external_service_no_alternative
90%+ measured time in external service, no batch/cache/parallel path
within_industry_norm
Measured time within expected range for operation type
infrastructure_bound
Bottleneck is hardware (measured via system metrics)
already_optimized
Code already uses best patterns (confirmed by suspicion scan)

Error Handling

ErrorRecovery
Cannot resolve entry pointBlock: "file/function not found at {path}"
Test command fails on unmodified codeBlock: "test fails before profiling — fix test first"
Profiler not available for stackFall back to invasive instrumentation (Phase 3 Step 2)
Instrumentation breaks testsRevert immediately:
git checkout -- .
Call chain too deep (> 5 levels)Stop at depth 5, note truncation
Cannot classify step typeDefault to "Unknown", use measured time
No I/O detected (pure CPU)Classify as CPU, focus on algorithm profiling

References

  • bottleneck_classification.md — classification taxonomy
  • latency_estimation.md — latency heuristics (fallback for static-only mode)
  • shared/references/ci_tool_detection.md
    — stack/tool detection
  • shared/references/benchmark_generation.md
    — benchmark templates per stack

Runtime Summary Artifact

MANDATORY READ: Load

shared/references/coordinator_summary_contract.md

Emit an

optimization-worker
summary envelope.

Managed mode:

  • ln-810
    passes deterministic
    runId
    and exact
    summaryArtifactPath
  • write the summary to the provided
    summaryArtifactPath

Standalone mode:

  • omit
    runId
    and
    summaryArtifactPath
  • write
    .hex-skills/runtime-artifacts/runs/{run_id}/optimization-worker/ln-811--{identifier}.json

Definition of Done

  • Test command discovered or created for optimization target
  • E2E safety test discovered (or documented as unavailable)
  • Baseline measured: wall time, CPU, memory (3 runs, median)
  • Call graph traced and function bodies read
  • Suspicion stack built: each suspicion verified and mapped to instrumentation point
  • Deep profile completed (non-invasive preferred, invasive if needed)
  • Instrumented files reported (cleanup deferred to executor)
  • Performance map built in standardized format (real measurements)
  • Top 3 bottlenecks identified from measured data
  • Wrong tool indicators evaluated from real metrics
  • optimization_hints contain only CONFIRMED suspicions with measurement evidence
  • Report prepared with measured findings
  • Optimization profile artifact written to the shared location

Version: 3.0.0 Last Updated: 2026-03-15