Claude-skill-registry alerting

Real-time alerting and notification system for Univers infrastructure. Use this when you need to monitor system health, service status, and send proactive alerts when thresholds are exceeded or services fail.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/alerting" ~/.claude/skills/majiayu000-claude-skill-registry-alerting && rm -rf "$T"

manifest: skills/data/alerting/SKILL.md

Alerting Skill

This skill provides comprehensive monitoring and alerting capabilities for the Univers infrastructure ecosystem.

Capabilities

1. Real-time Monitoring

System resource monitoring (CPU, Memory, Disk, Network)
Service health checks (HTTP endpoints, ports, processes)
Application-specific metrics (response times, error rates)
Custom metric collection and aggregation

2. Alert Engine

Threshold-based alerting
Rate limiting and alert suppression
Alert escalation policies
Multi-condition alert rules

3. Notification Channels

Email notifications with rich formatting
Slack/Teams integration with actionable messages
Webhook support for custom integrations
In-app notifications and banners

4. Alert Management

Alert acknowledgment and resolution
Alert history and analytics
Scheduled maintenance windows
Alert rule testing and validation

5. Dashboards and Reports

Real-time alert status dashboard
Historical alert trends and analytics
Service health overview
Performance metrics visualization

Common Tasks

Basic Alert Setup

# Check system for alert conditions
alert check system

# Monitor specific services
alert monitor services

# Test notification channels
alert test channels

Alert Rule Management

# List all alert rules
alert rules list

# Add new alert rule
alert rules add cpu-high --threshold 80 --duration 5m

# Update existing rule
alert rules update memory-usage --threshold 90

# Remove alert rule
alert rules remove disk-space-low

Notification Configuration

# Configure email notifications
alert config email --smtp smtp.example.com --from alerts@example.com

# Configure Slack integration
alert config slack --webhook https://hooks.slack.com/... --channel #alerts

# Test notification delivery
alert test email --to admin@example.com
alert test slack --message "Test alert"

Alert Operations

# View active alerts
alert status

# Acknowledge an alert
alert acknowledge CPU_HIGH_001

# Resolve an alert
alert resolve MEMORY_HIGH_003

# View alert history
alert history --last 24h

Alert Rule Examples

System Resource Alerts

# High CPU Usage
name: cpu-high
condition: cpu_usage > 80
duration: 5m
severity: warning
message: "CPU usage is {{cpu_usage}}% on {{hostname}}"
actions:
  - type: email
    to: ops@example.com
  - type: slack
    channel: #alerts

# Critical Memory Usage
name: memory-critical
condition: memory_usage > 90
duration: 2m
severity: critical
message: "Critical memory usage: {{memory_usage}}%"
actions:
  - type: webhook
    url: https://api.pagerduty.com/incidents

Service Health Alerts

# Service Down
name: service-down
condition: service_health == 0
duration: 1m
severity: critical
message: "{{service_name}} is down on {{hostname}}"
actions:
  - type: email
    to: devops@example.com
  - type: restart
    service: "{{service_name}}"

# High Response Time
name: slow-response
condition: response_time > 2000
duration: 3m
severity: warning
message: "{{service_name}} response time: {{response_time}}ms"
actions:
  - type: slack
    channel: #performance

Application-Specific Alerts

# High Error Rate
name: high-error-rate
condition: error_rate > 5
duration: 5m
severity: warning
message: "{{application}} error rate: {{error_rate}}%"
actions:
  - type: email
    to: dev-team@example.com

# Database Connection Issues
name: db-connection-failed
condition: db_connection_status != "healthy"
duration: 30s
severity: critical
message: "Database connection failed for {{application}}"
actions:
  - type: webhook
    url: https://hooks.slack.com/...

Integration Examples

Univers Services Integration

# Monitor Univers services
alert monitor univers-services

# Check specific Univers endpoints
alert check endpoint http://localhost:3003/health --service univers-server
alert check endpoint http://localhost:6007 --service univers-ui
alert check endpoint http://localhost:5173 --service univers-web

# Monitor tmux sessions
alert monitor tmux-sessions --alert-if-missing univers-developer

Container Integration

# Monitor Docker containers
alert monitor containers --include univers-*

# Check container health
alert check container univers-server
alert check container univers-ui

Configuration Files

Alert Rules Configuration

# ~/.config/univers/alerting/rules.yaml
rules:
  - name: system-cpu-high
    type: system
    metric: cpu_usage
    operator: ">"
    threshold: 80
    duration: 5m
    severity: warning

  - name: service-unavailable
    type: service
    check: http_status
    target: "http://localhost:3003/health"
    operator: "!="
    threshold: 200
    duration: 1m
    severity: critical

Notification Channels

# ~/.config/univers/alerting/channels.yaml
channels:
  email:
    smtp_host: smtp.gmail.com
    smtp_port: 587
    username: alerts@company.com
    password: ${SMTP_PASSWORD}

  slack:
    webhook_url: ${SLACK_WEBHOOK_URL}
    default_channel: #univers-alerts

  webhook:
    endpoint: https://api.example.com/alerts
    headers:
      Authorization: "Bearer ${API_TOKEN}"

Best Practices

Set Meaningful Thresholds: Avoid alert fatigue by setting realistic thresholds
Use Escalation Policies: Implement graduated alert escalation
Provide Context: Include relevant details in alert messages
Test Regularly: Verify alert rules and notification channels
Document Procedures: Maintain clear runbooks for common alerts

Troubleshooting

Common Issues

Missing Notifications: Check channel configurations and connectivity
False Positives: Review alert thresholds and conditions
Alert Storms: Implement rate limiting and suppression rules
Slow Performance: Optimize alert check intervals and data collection

Debug Commands

# Check alert engine status
alert status --verbose

# Test specific rule
alert test-rule cpu-high

# Check notification delivery
alert test-notification email --to test@example.com

# View alert engine logs
alert logs --tail 100

Version History

v1.0 (2025-12-16): Initial alerting system implementation
Basic monitoring, email notifications, and alert rules