Awesome-omni-skill Implementing Observability

Instrument the application with Logging, Metrics, and Tracing (OpenTelemetry) to understand system behavior and debug production issues.

install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/development/implementing-observability" ~/.claude/skills/diegosouzapw-awesome-omni-skill-implementing-observability-698f4f && rm -rf "$T"
manifest: skills/development/implementing-observability/SKILL.md
source content

Implementing Observability

Goal

Make the system's internal state inferable from its external outputs. Answer "Why is it slow?" and "Why did it fail?" without SSH-ing into a server.

When to Use

  • Before launching to production.
  • When debugging a performance bottleneck.
  • When integrating a new microservice or external API.

Instructions

1. Structured Logging

Text logs are hard to query. Use JSON.

  • Context: Every log must have
    trace_id
    ,
    request_id
    ,
    user_id
    .
  • Levels:
    INFO
    for normal ops,
    WARN
    for handled issues,
    ERROR
    for unhandled crashes.
{"level": "info", "msg": "User logged in", "user_id": 123, "trace_id": "abc-123"}

2. Distributed Tracing (OpenTelemetry)

Trace a request across boundaries (Frontend -> API -> DB).

  • Instrument HTTP clients and server frameworks.
  • Visualize the "waterfall" to find the slow span.

3. Golden Signals (Metrics)

Track the four key metrics for every service:

  • Latency: Time to serve a request.
  • Traffic: Request rate (RPS).
  • Errors: Rate of 5xx responses.
  • Saturation: CPU/Memory/Disk usage.

4. Alerting

Alert on symptoms (High Error Rate), not causes (High CPU).

  • Page: If
    Error Rate > 1%
    for 5 minutes.
  • Ticket: If
    Disk Usage > 80%
    .

Constraints

✅ Do

  • DO: Use OpenTelemetry standards for portability.
  • DO: Correlate logs and traces (inject trace ID into logs).
  • DO: Sample high-volume traces (10%) to save costs, but keep 100% of errors.

❌ Don't

  • DON'T: Log PII (Emails, Passwords, Credit Cards).
  • DON'T: Create alerts that auto-resolve in seconds (flapping).
  • DON'T: Rely solely on "system up" checks; check "business logic working".

Output Format

  • docker-compose.yml
    with Prometheus/Grafana/Jaeger (for dev).
  • Code instrumentation (e.g.,
    tracing.py
    ).

Dependencies

  • backend/managing-flask-middleware/SKILL.md
    (where instrumentation lives)
  • shared/debugging/SKILL.md