install
source · Clone the upstream repo
git clone https://github.com/Intense-Visions/harness-engineering
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Intense-Visions/harness-engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agents/skills/claude-code/otel-performance-insights" ~/.claude/skills/intense-visions-harness-engineering-otel-performance-insights && rm -rf "$T"
manifest:
agents/skills/claude-code/otel-performance-insights/SKILL.mdsource content
OpenTelemetry Performance Insights
Identify performance bottlenecks using trace analysis, histogram metrics, and span timing patterns
When to Use
- Debugging slow API responses — which operation in the chain is the bottleneck
- Setting and monitoring SLOs for latency (p50, p95, p99)
- Profiling database query performance across the application
- Comparing performance before and after a change (deployment comparison)
Instructions
- Use histogram metrics for latency tracking. Record duration of HTTP requests, database queries, and business operations.
- Add span attributes that enable dimensional analysis: endpoint, method, status, operation type.
- Instrument N+1 query detection by counting database spans per parent span.
- Track queue wait time separately from processing time.
- Set up SLO-based alerts on p99 latency, not just average.
- Use span links to connect batch processing back to triggering requests.
// Performance-instrumented service import { trace, metrics } from '@opentelemetry/api'; const tracer = trace.getTracer('order-service'); const meter = metrics.getMeter('order-service'); const requestDuration = meter.createHistogram('http.server.request.duration', { description: 'HTTP request duration', unit: 'ms', }); const dbQueryDuration = meter.createHistogram('db.query.duration', { description: 'Database query duration', unit: 'ms', }); const dbQueryCounter = meter.createCounter('db.query.count', { description: 'Number of database queries per request', unit: '1', }); // Middleware that tracks request performance export async function performanceMiddleware(req: Request, res: Response, next: NextFunction) { const start = performance.now(); const parentSpan = trace.getActiveSpan(); // Track DB query count for N+1 detection let queryCount = 0; const originalQuery = db.query; db.query = async (...args: any[]) => { queryCount++; const qStart = performance.now(); try { return await originalQuery.apply(db, args); } finally { dbQueryDuration.record(performance.now() - qStart, { 'db.operation': args[0]?.split(' ')[0] || 'unknown', }); } }; res.on('finish', () => { const duration = performance.now() - start; const route = req.route?.path || req.path; requestDuration.record(duration, { 'http.method': req.method, 'http.route': route, 'http.status_code': res.statusCode, }); dbQueryCounter.add(queryCount, { 'http.route': route }); // Flag potential N+1 queries if (queryCount > 10 && parentSpan) { parentSpan.addEvent('performance.warning', { 'warning.type': 'n_plus_one', 'db.query_count': queryCount, 'http.route': route, }); } db.query = originalQuery; }); next(); }
// SLO monitoring with histograms const sloLatencyTarget = 500; // 500ms target // In dashboard queries (PromQL example): // Error budget: 1 - (histogram_quantile(0.99, rate(http_server_request_duration_bucket[5m])) / 500) // SLO compliance: sum(rate(http_server_request_duration_bucket{le="500"}[5m])) / sum(rate(http_server_request_duration_count[5m]))
// Waterfall analysis helper — log span timing breakdown function analyzeTrace(spans: ReadableSpan[]): void { const root = spans.find((s) => !s.parentSpanId); if (!root) return; const totalMs = root.duration[0] * 1000 + root.duration[1] / 1e6; const breakdown = spans .filter((s) => s !== root) .map((s) => ({ name: s.name, duration: s.duration[0] * 1000 + s.duration[1] / 1e6, percentage: (((s.duration[0] * 1000 + s.duration[1] / 1e6) / totalMs) * 100).toFixed(1), })) .sort((a, b) => b.duration - a.duration); console.table(breakdown); // Shows which spans consume the most time in the request }
Details
Key performance metrics to track:
| Metric | Type | Purpose |
|---|---|---|
| Histogram | Overall API latency |
| Histogram | Database performance |
| Histogram | Outgoing call latency |
| Counter | N+1 query detection |
| UpDownCounter | Concurrency tracking |
Percentile analysis: Average latency hides outliers. Always track p50 (median), p95 (most users), and p99 (worst case):
- p50 = 50ms, p99 = 500ms — tail latency problem (some requests hit a slow path)
- p50 = 200ms, p99 = 220ms — consistently slow (all requests are slow)
Trace-based analysis pattern:
- Filter traces by latency (find traces > p99 threshold)
- Compare slow trace waterfall with fast trace waterfall
- Identify the span that differs — that is your bottleneck
- Check span attributes for clues (specific query, specific user, specific payload size)
Common bottleneck patterns:
- Sequential calls: Spans that should be parallel are sequential — use
Promise.all - N+1 queries: Many small DB spans instead of one batch query — use DataLoader or JOINs
- Missing cache: Same data fetched repeatedly — add caching with cache-hit span events
- Large payloads: Long serialization spans — paginate or compress
- Connection pool exhaustion: Long wait times before DB span starts — increase pool size
Deployment comparison: Tag spans with the deployment version. Compare p99 latency between versions to detect regressions immediately.
Source
https://opentelemetry.io/docs/concepts/signals/traces/#span-events
Process
- Read the instructions and examples in this document.
- Apply the patterns to your implementation, adapting to your specific context.
- Verify your implementation against the details and edge cases listed above.
Harness Integration
- Type: knowledge — this skill is a reference document, not a procedural workflow.
- No tools or state — consumed as context by other skills and agents.
Success Criteria
- The patterns described in this document are applied correctly in the implementation.
- Edge cases and anti-patterns listed in this document are avoided.