Awesome-omni-skill distributed-tracing
Use when debugging microservice latency or request flows, or when implementing tracing with OpenTelemetry, Jaeger, or Tempo.
install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/devops/distributed-tracing" ~/.claude/skills/diegosouzapw-awesome-omni-skill-distributed-tracing-e4b28f && rm -rf "$T"
manifest:
skills/devops/distributed-tracing/SKILL.mdsource content
Distributed Tracing
Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.
Do not use this skill when
- The task is unrelated to distributed tracing
- You need a different domain or tool outside this scope
Instructions
digraph tracing_workflow { rankdir=TD; node [shape=box, style=filled, fillcolor="#f9f9f9"]; start [label="Distributed Tracing Task", shape=oval, fillcolor="#ffdce0"]; scope [label="1. Define Scope\n(Identify target services/flows)", fillcolor="#e2f0cb"]; infra [label="2. Deploy / Check Infra\n(Jaeger, Tempo, Collectors)", fillcolor="#ffebbb"]; instrument [label="3. Instrument Applications\n(OpenTelemetry SDKs)", fillcolor="#c7ceea"]; propagate [label="4. Propagate Context\n(Inject/Extract Headers)", fillcolor="#fff3cd"]; validate [label="5. Analyze & Validate\n(Query Traces on Dashboard)", fillcolor="#d4edda"]; end [label="Done", shape=oval, fillcolor="#d4edda"]; start -> scope; scope -> infra; infra -> instrument; instrument -> propagate; propagate -> validate; validate -> end; }
MANDATORY WORKFLOW:
- Scope the effort (which services and request flows are involved).
- Ensure infrastructure (Jaeger/Tempo) is running and accessible.
- Instrument the target applications using the OpenTelemetry SDK (Span creation, Tags, Events).
- Verify HTTP/gRPC Context Propagation is active across service boundaries.
- Validate outcomes by querying and visualizing traces to solve the bottleneck or bug.
Purpose
Track requests across distributed systems to understand latency, dependencies, and failure points.
Use this skill when
- Debug latency issues
- Understand service dependencies
- Identify bottlenecks
- Trace error propagation
- Analyze request paths
Distributed Tracing Concepts
Trace Structure
Trace (Request ID: abc123) ↓ Span (frontend) [100ms] ↓ Span (api-gateway) [80ms] ├→ Span (auth-service) [10ms] └→ Span (user-service) [60ms] └→ Span (database) [40ms]
Key Components
- Trace - End-to-end request journey
- Span - Single operation within a trace
- Context - Metadata propagated between services
- Tags - Key-value pairs for filtering
- Logs - Timestamped events within a span
Jaeger Setup
Kubernetes Deployment
# Deploy Jaeger Operator kubectl create namespace observability kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability # Deploy Jaeger instance kubectl apply -f - <<EOF apiVersion: jaegertracing.io/v1 kind: Jaeger metadata: name: jaeger namespace: observability spec: strategy: production storage: type: elasticsearch options: es: server-urls: http://elasticsearch:9200 ingress: enabled: true EOF
Docker Compose
version: '3.8' services: jaeger: image: jaegertracing/all-in-one:latest ports: - "5775:5775/udp" - "6831:6831/udp" - "6832:6832/udp" - "5778:5778" - "16686:16686" # UI - "14268:14268" # Collector - "14250:14250" # gRPC - "9411:9411" # Zipkin environment: - COLLECTOR_ZIPKIN_HOST_PORT=:9411
Reference: See
references/jaeger-setup.md
Application Instrumentation
OpenTelemetry (Recommended)
Python (Flask)
from opentelemetry import trace from opentelemetry.exporter.jaeger.thrift import JaegerExporter from opentelemetry.sdk.resources import SERVICE_NAME, Resource from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.instrumentation.flask import FlaskInstrumentor from flask import Flask # Initialize tracer resource = Resource(attributes={SERVICE_NAME: "my-service"}) provider = TracerProvider(resource=resource) processor = BatchSpanProcessor(JaegerExporter( agent_host_name="jaeger", agent_port=6831, )) provider.add_span_processor(processor) trace.set_tracer_provider(provider) # Instrument Flask app = Flask(__name__) FlaskInstrumentor().instrument_app(app) @app.route('/api/users') def get_users(): tracer = trace.get_tracer(__name__) with tracer.start_as_current_span("get_users") as span: span.set_attribute("user.count", 100) # Business logic users = fetch_users_from_db() return {"users": users} def fetch_users_from_db(): tracer = trace.get_tracer(__name__) with tracer.start_as_current_span("database_query") as span: span.set_attribute("db.system", "postgresql") span.set_attribute("db.statement", "SELECT * FROM users") # Database query return query_database()
Node.js (Express)
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node'); const { JaegerExporter } = require('@opentelemetry/exporter-jaeger'); const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base'); const { registerInstrumentations } = require('@opentelemetry/instrumentation'); const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http'); const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express'); // Initialize tracer const provider = new NodeTracerProvider({ resource: { attributes: { 'service.name': 'my-service' } } }); const exporter = new JaegerExporter({ endpoint: 'http://jaeger:14268/api/traces' }); provider.addSpanProcessor(new BatchSpanProcessor(exporter)); provider.register(); // Instrument libraries registerInstrumentations({ instrumentations: [ new HttpInstrumentation(), new ExpressInstrumentation(), ], }); const express = require('express'); const app = express(); app.get('/api/users', async (req, res) => { const tracer = trace.getTracer('my-service'); const span = tracer.startSpan('get_users'); try { const users = await fetchUsers(); span.setAttributes({ 'user.count': users.length }); res.json({ users }); } finally { span.end(); } });
Go
package main import ( "context" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/exporters/jaeger" "go.opentelemetry.io/otel/sdk/resource" sdktrace "go.opentelemetry.io/otel/sdk/trace" semconv "go.opentelemetry.io/otel/semconv/v1.4.0" ) func initTracer() (*sdktrace.TracerProvider, error) { exporter, err := jaeger.New(jaeger.WithCollectorEndpoint( jaeger.WithEndpoint("http://jaeger:14268/api/traces"), )) if err != nil { return nil, err } tp := sdktrace.NewTracerProvider( sdktrace.WithBatcher(exporter), sdktrace.WithResource(resource.NewWithAttributes( semconv.SchemaURL, semconv.ServiceNameKey.String("my-service"), )), ) otel.SetTracerProvider(tp) return tp, nil } func getUsers(ctx context.Context) ([]User, error) { tracer := otel.Tracer("my-service") ctx, span := tracer.Start(ctx, "get_users") defer span.End() span.SetAttributes(attribute.String("user.filter", "active")) users, err := fetchUsersFromDB(ctx) if err != nil { span.RecordError(err) return nil, err } span.SetAttributes(attribute.Int("user.count", len(users))) return users, nil }
Reference: See
references/instrumentation.md
Context Propagation
HTTP Headers
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01 tracestate: congo=t61rcWkgMzE
Propagation in HTTP Requests
Python
from opentelemetry.propagate import inject headers = {} inject(headers) # Injects trace context response = requests.get('http://downstream-service/api', headers=headers)
Node.js
const { propagation } = require('@opentelemetry/api'); const headers = {}; propagation.inject(context.active(), headers); axios.get('http://downstream-service/api', { headers });
Tempo Setup (Grafana)
Kubernetes Deployment
apiVersion: v1 kind: ConfigMap metadata: name: tempo-config data: tempo.yaml: | server: http_listen_port: 3200 distributor: receivers: jaeger: protocols: thrift_http: grpc: otlp: protocols: http: grpc: storage: trace: backend: s3 s3: bucket: tempo-traces endpoint: s3.amazonaws.com querier: frontend_worker: frontend_address: tempo-query-frontend:9095 --- apiVersion: apps/v1 kind: Deployment metadata: name: tempo spec: replicas: 1 template: spec: containers: - name: tempo image: grafana/tempo:latest args: - -config.file=/etc/tempo/tempo.yaml volumeMounts: - name: config mountPath: /etc/tempo volumes: - name: config configMap: name: tempo-config
Reference: See
assets/jaeger-config.yaml.template
Sampling Strategies
Probabilistic Sampling
# Sample 1% of traces sampler: type: probabilistic param: 0.01
Rate Limiting Sampling
# Sample max 100 traces per second sampler: type: ratelimiting param: 100
Adaptive Sampling
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased # Sample based on trace ID (deterministic) sampler = ParentBased(root=TraceIdRatioBased(0.01))
Trace Analysis
Finding Slow Requests
Jaeger Query:
service=my-service duration > 1s
Finding Errors
Jaeger Query:
service=my-service error=true tags.http.status_code >= 500
Service Dependency Graph
Jaeger automatically generates service dependency graphs showing:
- Service relationships
- Request rates
- Error rates
- Average latencies
Best Practices
- Sample appropriately (1-10% in production)
- Add meaningful tags (user_id, request_id)
- Propagate context across all service boundaries
- Log exceptions in spans
- Use consistent naming for operations
- Monitor tracing overhead (<1% CPU impact)
- Set up alerts for trace errors
- Implement distributed context (baggage)
- Use span events for important milestones
- Document instrumentation standards
Integration with Logging
Correlated Logs
import logging from opentelemetry import trace logger = logging.getLogger(__name__) def process_request(): span = trace.get_current_span() trace_id = span.get_span_context().trace_id logger.info( "Processing request", extra={"trace_id": format(trace_id, '032x')} )
Troubleshooting
No traces appearing:
- Check collector endpoint
- Verify network connectivity
- Check sampling configuration
- Review application logs
High latency overhead:
- Reduce sampling rate
- Use batch span processor
- Check exporter configuration
Reference Files
- Jaeger installationreferences/jaeger-setup.md
- Instrumentation patternsreferences/instrumentation.md
- Jaeger configurationassets/jaeger-config.yaml.template
Related Skills
- For metricsprometheus-configuration
- For visualizationgrafana-dashboards
- For latency SLOsslo-implementation
中文执行层
触发条件
- Use when debugging microservice latency or request flows, or when implementing tracing with OpenTelemetry, Jaeger, or Tempo.
前置条件
- 已确认当前任务与本 skill 的适用范围匹配。
- 已读取本文件的关键步骤,并确认命令路径基于仓库真实文件。
- 若依赖外部工具或凭据,先执行最小可用性检查(如 --help 或版本检查)。
执行步骤
- 先按本 skill 的流程章节确认边界和产出物。
- 先执行最小可验证步骤,再逐步扩展到完整实现。
- 过程中的关键命令、输入和结果要记录到可复盘证据中。
- 若与 AGENTS.md 路由冲突,以项目级约定和任务目标为准。
完成证据
- 提供关键命令与输出摘要,必要时附日志或报告文件路径。
- 列出受影响文件和核心改动点,确保与需求一一对应。
- 明确说明验证是否通过,以及尚未覆盖的风险。
失败回退
- 失败时先保留现场与报错信息,再定位根因并重试。
- 如需降级方案,必须说明影响范围、回退路径和补偿措施。