Awesome-omni-skill distributed-tracing
Use when implementing distributed tracing, understanding trace propagation, or debugging cross-service issues. Covers OpenTelemetry, span context, and trace correlation.
install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/devops/distributed-tracing-melodic-software" ~/.claude/skills/diegosouzapw-awesome-omni-skill-distributed-tracing-5c2a73 && rm -rf "$T"
manifest:
skills/devops/distributed-tracing-melodic-software/SKILL.mdsource content
Distributed Tracing
Patterns and practices for implementing distributed tracing across microservices and understanding request flows in distributed systems.
When to Use This Skill
- Implementing distributed tracing in microservices
- Debugging cross-service request issues
- Understanding trace propagation
- Choosing tracing infrastructure
- Correlating logs, metrics, and traces
Why Distributed Tracing?
Problem: Request flows through multiple services How do you debug when something fails? Without tracing: User → API → ??? → ??? → Error somewhere With tracing: User → API (50ms) → OrderService (20ms) → PaymentService (ERROR: timeout) └── Full visibility into request flow
Core Concepts
Traces, Spans, and Context
Trace: End-to-end request journey ├── Span: Single operation within a service │ ├── SpanID: Unique identifier │ ├── ParentSpanID: Link to parent span │ ├── TraceID: Shared across all spans │ ├── Operation Name: What is being done │ ├── Start/End Time: Duration │ ├── Status: Success/Error │ ├── Attributes: Key-value metadata │ └── Events: Point-in-time annotations │ └── Context: Propagated across service boundaries ├── TraceID ├── SpanID ├── Trace Flags └── Trace State
Trace Visualization
TraceID: abc123 Service A (API Gateway) ├──────────────────────────────────────────────────────┤ 200ms │ └─► Service B (Order Service) ├───────────────────────────────────┤ 150ms │ ├─► Service C (Inventory) │ ├───────────────┤ 50ms │ └─► Service D (Payment) ├───────────────────────┤ 80ms │ └─► External API ├─────────┤ 60ms
OpenTelemetry
Overview
OpenTelemetry = Unified observability framework Components: ┌─────────────────────────────────────────────────────┐ │ Application │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ SDK │ │ Tracer │ │ Meter │ │ │ │ │ │ Provider │ │ Provider │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ └─────────────────────────────────────────────────────┘ │ │ │ └───────────────┼───────────────┘ ▼ ┌─────────────────────────┐ │ OTLP Exporter │ └─────────────────────────┘ │ ▼ ┌─────────────────────────┐ │ Collector │ │ (Optional) │ └─────────────────────────┘ │ ┌───────────────┼───────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Jaeger │ │ Zipkin │ │ Tempo │ └─────────┘ └─────────┘ └─────────┘
Trace Context Propagation
HTTP Headers (W3C Trace Context): traceparent: 00-{trace-id}-{span-id}-{flags} tracestate: vendor1=value1,vendor2=value2 Example: traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01 │ │ │ └─ sampled │ │ └─ parent span id │ └─ trace id (128-bit) └─ version Propagation across services: ┌─────────────┐ ┌─────────────┐ │ Service A │ ─── HTTP ──────────►│ Service B │ │ │ traceparent: 00-... │ │ │ Create Span │ │ Extract │ │ Inject │ │ Create Span │ └─────────────┘ └─────────────┘
Span Attributes
Semantic conventions (standard attributes): HTTP: - http.method: GET, POST, etc. - http.url: Full URL - http.status_code: 200, 404, 500 - http.route: /users/{id} Database: - db.system: postgresql, mysql - db.statement: SELECT * FROM... - db.operation: query, insert RPC: - rpc.system: grpc - rpc.service: OrderService - rpc.method: CreateOrder Custom: - user.id: 12345 - order.total: 99.99 - feature.flag: experiment_v2
Tracing Backends
Jaeger
Features: - Open source (CNCF) - Built-in UI - Multiple storage backends - OpenTelemetry native Architecture: ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Agent │─►│ Collector │─►│ Storage │ │ (optional) │ │ │ │ (Cassandra/ │ └─────────────┘ └─────────────┘ │ Elasticsearch) │ └─────────────┘ ▼ ┌─────────────┐ │ Query │ │ Service │ └─────────────┘ │ ▼ ┌─────────────┐ │ UI │ └─────────────┘
Zipkin
Features: - Mature, battle-tested - Simple architecture - Low resource overhead - Good ecosystem support Best for: - Simpler setups - Lower resource environments - Teams familiar with Zipkin
Grafana Tempo
Features: - Object storage backend (cheap) - Deep Grafana integration - Log-based trace discovery - Exemplars support Best for: - Grafana-heavy environments - Cost-sensitive deployments - Large-scale traces
Cloud Native Options
| Provider | Service | Integration |
|---|---|---|
| AWS | X-Ray | Native AWS services |
| GCP | Cloud Trace | Native GCP services |
| Azure | Application Insights | Native Azure services |
| Datadog | APM | Full-stack observability |
Sampling Strategies
Why Sample?
High-traffic systems generate millions of spans. Storing all spans is expensive and often unnecessary. Sampling: Collect a subset of traces Goal: Keep enough data to debug issues while managing costs
Sampling Types
1. Head-based sampling (at trace start): - Decision made when trace begins - Consistent across services - Simple but may miss rare events 2. Tail-based sampling (after trace complete): - Decision made after seeing full trace - Can keep interesting traces (errors, slow) - Requires buffering spans - More complex infrastructure 3. Priority sampling: - Assign priority based on attributes - Keep all errors, sample normal traffic
Sampling Strategies
Rate-based: - Sample 10% of all traces - Simple, predictable cost Priority-based: - 100% of errors - 100% of slow requests (>1s) - 5% of normal requests Adaptive: - Adjust rate based on traffic - Target specific traces/second - Handle traffic spikes
Correlation Patterns
Logs-Traces-Metrics
Three Pillars of Observability: Logs ◄──────────► Traces ◄──────────► Metrics │ │ │ │ trace_id │ exemplars │ │ span_id │ │ └──────────────────┴───────────────────┘ Correlation: 1. Add trace_id/span_id to log entries 2. Add exemplars (trace links) to metrics 3. Click from metric → trace → logs
Log Correlation
Structured log with trace context: { "timestamp": "2024-01-15T10:30:00Z", "level": "ERROR", "message": "Payment failed", "trace_id": "abc123def456", "span_id": "789xyz", "service": "payment-service", "user_id": "12345", "error": "Card declined" } Query in log aggregator: trace_id:"abc123def456" → See all logs for this request
Exemplars (Metrics to Traces)
Metric with exemplar: http_request_duration{service="api"} = 2.5s └── exemplar: trace_id=abc123 When latency spikes: 1. See metric spike in dashboard 2. Click on data point 3. Jump directly to slow trace 4. See exactly what caused latency
Instrumentation Patterns
Automatic Instrumentation
Zero-code instrumentation: - HTTP clients/servers - Database clients - Message queues - gRPC Pros: Easy, comprehensive Cons: Less control, more noise
Manual Instrumentation
Add spans for business logic: with tracer.start_span("process_order") as span: span.set_attribute("order.id", order_id) span.set_attribute("order.items", len(items)) result = process(order) if result.error: span.set_status(Status(StatusCode.ERROR)) span.record_exception(result.error) Pros: Precise, business-relevant Cons: More code, maintenance
Hybrid Approach (Recommended)
1. Auto-instrument infrastructure: - HTTP, database, queue calls 2. Manual instrument business logic: - Key operations - Business metrics - Error context
Best Practices
Span Design
Good span names: - HTTP GET /api/orders/{id} - ProcessPayment - db.query users Bad span names: - Handler (too generic) - /api/orders/12345 (cardinality explosion) - doStuff (meaningless)
Attribute Guidelines
Do: - Use semantic conventions - Add business context (user_id, order_id) - Keep cardinality low - Include error details Don't: - Add PII (personally identifiable info) - Use high-cardinality values as attributes - Add large payloads - Include sensitive data
Performance Considerations
1. Use async span export 2. Sample appropriately 3. Limit attribute count 4. Use span processor batching 5. Consider span limits
Troubleshooting with Traces
Common Patterns
Finding slow requests: 1. Query traces by duration > threshold 2. Identify slow spans 3. Check span attributes for context Finding errors: 1. Query traces by status = ERROR 2. See error span and context 3. Check exception details Finding dependencies: 1. View service map from traces 2. Identify critical paths 3. Find hidden dependencies
Related Skills
- Three pillars overviewobservability-patterns
- Using traces for SLIsslo-sli-error-budget
- Using traces in incidentsincident-response