Swe-skills swe:observability-gap-hunt
git clone https://github.com/ckorhonen/swe-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/ckorhonen/swe-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/observability-gap-hunt" ~/.claude/skills/ckorhonen-swe-skills-swe-observability-gap-hunt && rm -rf "$T"
skills/observability-gap-hunt/SKILL.mdSWE Observability Gap Hunt
What This Skill Does
Use this skill to find observability blind spots in a repository and turn them into a small, reviewable backlog.
The job is to identify where a service or code path lacks enough signal to be operated confidently, then rank the smallest improvements that would materially improve detection, diagnosis, or alerting.
This is not a performance tuning skill. The focus is telemetry coverage and operational visibility.
When To Use
Use this skill when the user wants to:
- audit logs, metrics, traces, alerts, or dashboards for gaps
- check whether important workflows are observable enough to operate safely
- find missing deployment-linked telemetry or runbook coverage
- run a recurring observability review over time
Do Not Use
Do not use this skill for:
- live incident response or active root-cause analysis
- generic latency, throughput, or performance optimization
- broad application code review with no observability goal
- redesigning the entire monitoring stack
- replacing existing observability tooling without repository evidence
Inputs To Confirm
Confirm or infer:
- repository, service, or package scope
- whether the user wants a report-only pass or a small backlog of follow-up work
- which observability systems are available locally or in connected tools
- whether recent incidents, deploys, or operational pain points matter
- any no-touch areas or known noisy surfaces
If scope is unclear, narrow it to the smallest service or package set that still fits the request.
Tooling Stance
This skill is tool agnostic.
Use the strongest available evidence sources, such as:
- application code and middleware
- logging helpers and structured log patterns
- metrics emitters, counters, timers, and labels
- tracing spans and context propagation
- alert rules and SLO definitions
- dashboard or panel configuration
- deploy manifests, release hooks, or health checks
- runbooks or operational docs tied to the service
If external telemetry systems are available, cross-check repo evidence against them. If they are not available, say so and stay grounded in what the repo shows.
Parallelization Rule
Create one session per cleanly separated service, package, or deployment unit when the environment supports parallel work.
- run only on disjoint surfaces
- keep each session bounded to one service or package
- return raw evidence from each session, not just conclusions
- deduplicate and rank the final backlog centrally
If surfaces overlap heavily, keep the pass serial and smaller.
Instructions
Step 1: Identify The Audit Units
Split the repository into practical observability units, such as:
- services
- apps
- packages
- jobs or workers
- deployable components
Focus on units where missing telemetry would materially hurt detection or diagnosis.
Step 2: Map Critical Workflows And Failure Paths
For each unit, locate the paths that matter most operationally:
- request entry points
- background jobs
- retries and idempotency boundaries
- state transitions
- external integrations
- error handling and fallback paths
These are the places where missing telemetry is most costly.
Step 3: Inspect Existing Observability Signals
Look for concrete evidence of:
- structured logs with useful fields
- metrics on the main success, latency, and failure paths
- traces or spans around important boundaries
- alerts or SLOs tied to the unit
- dashboards or panels that show the unit's health
- deploy or release checks that confirm the unit is live and healthy
Treat weak signal as a real gap if it would slow down diagnosis or hide regressions.
Step 4: Rank Blind Spots By Operational Value
Prioritize gaps that would most improve:
- detection of failures
- diagnosis speed
- alert quality
- operational confidence after deploys
Prefer gaps that are:
- small enough to implement in a focused follow-up
- local to one service or package
- easy to verify with targeted checks
Step 5: Propose A Tight Backlog
Turn the strongest gaps into ticket-shaped recommendations.
Each item should include:
- the surface or workflow affected
- the evidence that observability is weak or missing
- the specific telemetry gap
- why it matters operationally
- the smallest practical follow-up
- any validation or rollout notes
Do not turn this into a redesign plan for the monitoring platform.
Step 6: Call Out Unknowns Clearly
If telemetry systems, dashboards, or alerts are not accessible, say that explicitly.
Distinguish between:
- directly observed gaps
- gaps inferred from repo evidence
- areas you could not verify
Step 7: Leave A Repeatable Next Pass
If the repo has many units, leave a short backlog for the next scheduled pass instead of trying to cover everything in one run.
Output Requirements
Provide a report with these sections:
- Scope audited
- Observability signals reviewed
- Ranked blind spots
- Proposed follow-up backlog
- Unknowns or limits
For each ranked gap, include:
- unit or surface
- evidence
- missing signal
- operational impact
- smallest recommended fix
- priority
If there are no material gaps, say so plainly and explain what coverage appears adequate.
Quality Bar
- Stay focused on telemetry coverage, not product performance tuning.
- Ground claims in concrete repo or tooling evidence.
- Prefer a small, high-value backlog over a broad monitoring wish list.
- Label unknowns and inferences honestly.
- Keep recommendations local, actionable, and easy to validate.