Awesome-omni-skill monitor
Observe application health and gather feedback after deployment. Use to validate success criteria, collect metrics, and feed issues back to the backlog. Closes the feedback loop.
install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/devops/monitor" ~/.claude/skills/diegosouzapw-awesome-omni-skill-monitor-396baa && rm -rf "$T"
manifest:
skills/devops/monitor/SKILL.mdsource content
Monitor
Observe deployed application health and gather feedback to inform future development.
Purpose
Monitoring completes the feedback loop:
- Validate success criteria from intent
- Collect performance and health metrics
- Gather user feedback
- Identify issues for the backlog
- Ensure deployed features work as expected
Input
Expect from orchestrator:
- Deployed application URL/endpoints
- Success criteria from intent
- Metrics to track
- Monitoring duration/frequency
Process
1. Health monitoring
Check application health continuously:
health_checks: - name: "API availability" endpoint: "/health" method: "GET" expected_status: 200 interval: "60s" timeout: "5s" - name: "Database connectivity" endpoint: "/health/db" method: "GET" expected_status: 200 interval: "60s" - name: "External dependencies" endpoint: "/health/deps" method: "GET" expected_status: 200 interval: "300s"
2. Performance metrics
Collect performance data:
metrics: - name: "response_time_p50" source: "apm" threshold: 200 # ms alert_above: 500 - name: "response_time_p99" source: "apm" threshold: 1000 # ms alert_above: 2000 - name: "error_rate" source: "logs" threshold: 0.01 # 1% alert_above: 0.05 - name: "throughput" source: "apm" baseline: 100 # req/s alert_below: 50 - name: "cpu_usage" source: "infrastructure" threshold: 70 # % alert_above: 90 - name: "memory_usage" source: "infrastructure" threshold: 80 # % alert_above: 95
3. Success criteria validation
Verify intent success criteria are met:
success_validation: - criterion: "Users can register successfully" metric: "registration_success_rate" target: ">= 99%" actual: "99.5%" status: "pass" - criterion: "Page loads under 2 seconds" metric: "page_load_time_p95" target: "< 2000ms" actual: "1850ms" status: "pass" - criterion: "Zero critical errors in first 24h" metric: "critical_error_count" target: "= 0" actual: "0" status: "pass"
4. Error tracking
Monitor for errors and exceptions:
error_tracking: sources: - "application_logs" - "error_reporting_service" # Sentry, Bugsnag, etc. - "browser_console" categories: - severity: "critical" pattern: "uncaught exception|fatal error|500" action: "immediate_alert" - severity: "error" pattern: "error|failed|exception" action: "aggregate_and_review" - severity: "warning" pattern: "warning|deprecated|slow" action: "log_for_review"
5. User feedback collection
Gather feedback from various sources:
feedback_sources: - source: "support_tickets" integration: "zendesk" filter: "tag:new-feature" - source: "user_surveys" integration: "typeform" trigger: "post_feature_use" - source: "analytics_events" integration: "mixpanel" events: ["feature_used", "feature_abandoned", "error_encountered"] - source: "social_mentions" integration: "twitter_api" keywords: ["@ourapp", "ourapp bug"]
6. Analyse and categorise
Process collected data:
analysis: - type: "error_pattern" description: "Timeout errors spike during peak hours" frequency: "12 occurrences/hour" impact: "medium" recommendation: "Investigate database connection pooling" - type: "user_friction" description: "Users abandoning registration at email step" metric: "30% drop-off" impact: "high" recommendation: "Review email validation UX" - type: "performance_regression" description: "API response time increased 40% after deploy" metric: "p99: 800ms → 1120ms" impact: "medium" recommendation: "Profile recent changes"
7. Generate backlog items
Create stories from feedback:
backlog_additions: - id: "US-015" title: "Optimise database queries for peak load" source: "monitor:error_pattern" description: | As a user, I want fast response times during peak hours, So that I can complete tasks without delays. priority: 2 evidence: "Timeout errors during 9-11am, 2-4pm" - id: "US-016" title: "Improve email validation feedback" source: "monitor:user_friction" description: | As a new user, I want clear feedback when my email is invalid, So that I can correct it and complete registration. priority: 1 evidence: "30% drop-off at email step, support tickets about 'email not accepted'"
Output
monitor: period: start: "2024-01-15T10:00:00Z" end: "2024-01-16T10:00:00Z" duration: "24h" status: "healthy | degraded | unhealthy" health: uptime: "99.95%" incidents: 0 health_checks: passed: 1440 failed: 0 performance: response_time_p50: "145ms" response_time_p99: "890ms" error_rate: "0.02%" throughput: "125 req/s" success_criteria: met: 3 total: 3 details: - criterion: "Users can register" status: "pass" - criterion: "Page loads < 2s" status: "pass" - criterion: "Zero critical errors" status: "pass" errors: critical: 0 error: 12 warning: 45 top_errors: - message: "Connection timeout" count: 8 first_seen: "2024-01-15T14:23:00Z" feedback: support_tickets: 3 user_surveys: 15 nps_score: 42 sentiment: "positive" backlog_additions: - id: "US-015" title: "Optimise database queries" priority: 2 - id: "US-016" title: "Improve email validation" priority: 1 recommendations: - "Monitor database connection pool during peak hours" - "Consider adding email format hints to registration form" - "Review timeout thresholds for external API calls"
Update manifest:
releases: - version: "1.2.3" monitoring: status: "healthy" period: "2024-01-15 to 2024-01-16" success_criteria_met: true backlog: # New items added from monitoring - id: "US-015" source: "monitor" # ...
Monitoring duration
monitoring_schedule: post_deploy: duration: "24h" intensity: "high" checks_interval: "1m" steady_state: duration: "ongoing" intensity: "normal" checks_interval: "5m" alerts: critical: "immediate" error: "within 15m" warning: "daily digest"
Alert escalation
escalation: level_1: condition: "health_check_failed" wait: "5m" action: "slack_notification" level_2: condition: "health_check_failed for 15m" action: "pagerduty_alert" level_3: condition: "health_check_failed for 30m" action: "phone_call_oncall"
Tips
- Set baselines before comparing metrics
- Alert on symptoms, not just causes
- Avoid alert fatigue with appropriate thresholds
- Correlate metrics across services
- Keep monitoring configs in version control
- Review and tune thresholds regularly
- Document what "normal" looks like