Marketplace observability-monitoring
Structured logging, metrics, distributed tracing, and alerting strategies
install
source · Clone the upstream repo
git clone https://github.com/aiskillstore/marketplace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aiskillstore/marketplace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/ariegoldkin/observability-monitoring" ~/.claude/skills/aiskillstore-marketplace-observability-monitoring && rm -rf "$T"
manifest:
skills/ariegoldkin/observability-monitoring/SKILL.mdsource content
Observability & Monitoring Skill
Comprehensive frameworks for implementing observability including structured logging, metrics, distributed tracing, and alerting.
When to Use
- Setting up application monitoring
- Implementing structured logging
- Adding metrics and dashboards
- Configuring distributed tracing
- Creating alerting rules
- Debugging production issues
Three Pillars of Observability
┌─────────────────┬─────────────────┬─────────────────┐ │ LOGS │ METRICS │ TRACES │ ├─────────────────┼─────────────────┼─────────────────┤ │ What happened │ How is system │ How do requests │ │ at specific │ performing │ flow through │ │ point in time │ over time │ services │ └─────────────────┴─────────────────┴─────────────────┘
Structured Logging
Log Levels
| Level | Use Case |
|---|---|
| ERROR | Unhandled exceptions, failed operations |
| WARN | Deprecated API, retry attempts |
| INFO | Business events, successful operations |
| DEBUG | Development troubleshooting |
Best Practice
// Good: Structured with context logger.info('User action completed', { action: 'purchase', userId: user.id, orderId: order.id, duration_ms: 150 }); // Bad: String interpolation logger.info(`User ${user.id} completed purchase`);
See
for Winston setup and request middlewaretemplates/structured-logging.ts
Metrics Collection
RED Method (Rate, Errors, Duration)
Essential metrics for any service:
- Rate - Requests per second
- Errors - Failed requests per second
- Duration - Request latency distribution
Prometheus Buckets
// HTTP request latency buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5] // Database query latency buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1]
See
for full metrics configurationtemplates/prometheus-metrics.ts
Distributed Tracing
OpenTelemetry Setup
Auto-instrument common libraries:
- Express/HTTP
- PostgreSQL
- Redis
Manual Spans
tracer.startActiveSpan('processOrder', async (span) => { span.setAttribute('order.id', orderId); // ... work span.end(); });
See
for full setuptemplates/opentelemetry-tracing.ts
Alerting Strategy
Severity Levels
| Level | Response Time | Examples |
|---|---|---|
| Critical (P1) | < 15 min | Service down, data loss |
| High (P2) | < 1 hour | Major feature broken |
| Medium (P3) | < 4 hours | Increased error rate |
| Low (P4) | Next day | Warnings |
Key Alerts
| Alert | Condition | Severity |
|---|---|---|
| ServiceDown | for 1m | Critical |
| HighErrorRate | 5xx > 5% for 5m | Critical |
| HighLatency | p95 > 2s for 5m | High |
| LowCacheHitRate | < 70% for 10m | Medium |
See
for Prometheus alerting rulestemplates/alerting-rules.yml
Health Checks
Kubernetes Probes
| Probe | Purpose | Endpoint |
|---|---|---|
| Liveness | Is app running? | |
| Readiness | Ready for traffic? | |
| Startup | Finished starting? | |
Readiness Response
{ "status": "healthy|degraded|unhealthy", "checks": { "database": { "status": "pass", "latency_ms": 5 }, "redis": { "status": "pass", "latency_ms": 2 } }, "version": "1.0.0", "uptime": 3600 }
See
for implementationtemplates/health-checks.ts
Observability Checklist
Implementation
- JSON structured logging
- Request correlation IDs
- RED metrics (Rate, Errors, Duration)
- Business metrics
- Distributed tracing
- Health check endpoints
Alerting
- Service outage alerts
- Error rate thresholds
- Latency thresholds
- Resource utilization alerts
Dashboards
- Service overview
- Error analysis
- Performance metrics
Extended Thinking Triggers
Use Opus 4.5 extended thinking for:
- Incident investigation - Correlating logs, metrics, traces
- Alert tuning - Reducing noise, catching real issues
- Architecture decisions - Choosing monitoring solutions
- Performance debugging - Cross-service latency analysis
Templates Reference
| Template | Purpose |
|---|---|
| Winston logger with request middleware |
| HTTP, DB, cache metrics with middleware |
| Distributed tracing setup |
| Prometheus alerting rules |
| Liveness, readiness, startup probes |