Skills aws-ecs-monitor

AWS ECS production health monitoring with CloudWatch log analysis — monitors ECS service health, ALB targets, SSL certificates, and provides deep CloudWatch log analysis for error categorization, restart detection, and production alerts.

install

source · Clone the upstream repo

git clone https://github.com/openclaw/skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/briancolinger/aws-ecs-monitor" ~/.claude/skills/clawdbot-skills-aws-ecs-monitor && rm -rf "$T"

manifest: skills/briancolinger/aws-ecs-monitor/SKILL.md

source content

AWS ECS Monitor

Production health monitoring and log analysis for AWS ECS services.

What It Does

Health Checks: HTTP probes against your domain, ECS service status (desired vs running), ALB target group health, SSL certificate expiry
Log Analysis: Pulls CloudWatch logs, categorizes errors (panics, fatals, OOM, timeouts, 5xx), detects container restarts, filters health check noise
Auto-Diagnosis: Reads health status and automatically investigates failing services via log analysis

Prerequisites

aws

CLI configured with appropriate IAM permissions:

```
ecs:ListServices
```
,
```
ecs:DescribeServices
```

elasticloadbalancing:DescribeTargetGroups

elasticloadbalancing:DescribeTargetHealth

logs:FilterLogEvents

logs:DescribeLogGroups

```
curl
```
for HTTP health checks
```
python3
```
for JSON processing and log analysis
```
openssl
```
for SSL certificate checks (optional)

Configuration

All configuration is via environment variables:

Variable	Required	Default	Description
`ECS_CLUSTER`	Yes	—	ECS cluster name
`ECS_REGION`	No	`us-east-1`	AWS region
`ECS_DOMAIN`	No	—	Domain for HTTP/SSL checks (skip if unset)
`ECS_SERVICES`	No	auto-detect	Comma-separated service names to monitor
`ECS_HEALTH_STATE`	No	`./data/ecs-health.json`	Path to write health state JSON
`ECS_HEALTH_OUTDIR`	No	`./data/`	Output directory for logs and alerts
`ECS_LOG_PATTERN`	No	`/ecs/{service}`	CloudWatch log group pattern ( `{service}` is replaced)
`ECS_HTTP_ENDPOINTS`	No	—	Comma-separated `name=url` pairs for HTTP probes

Directories Written

ECS_HEALTH_STATE
(default:
```
./data/ecs-health.json
```
) — Health state JSON file
ECS_HEALTH_OUTDIR
(default:
```
./data/
```
) — Output directory for logs, alerts, and analysis reports

Scripts

scripts/ecs-health.sh

— Health Monitor

# Full check
ECS_CLUSTER=my-cluster ECS_DOMAIN=example.com ./scripts/ecs-health.sh

# JSON output only
ECS_CLUSTER=my-cluster ./scripts/ecs-health.sh --json

# Quiet mode (no alerts, just status file)
ECS_CLUSTER=my-cluster ./scripts/ecs-health.sh --quiet

Exit codes:

= healthy,

= unhealthy/degraded,

= script error

scripts/cloudwatch-logs.sh

— Log Analyzer

# Pull raw logs from a service
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh pull my-api --minutes 30

# Show errors across all services
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh errors all --minutes 120

# Deep analysis with error categorization
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh diagnose --minutes 60

# Detect container restarts
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh restarts my-api

# Auto-diagnose from health state file
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh auto-diagnose

# Summary across all services
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh summary --minutes 120

Options:

--minutes N

(default: 60),

--json

--limit N

(default: 200),

--verbose

Auto-Detection

When

ECS_SERVICES

is not set, both scripts auto-detect services from the cluster:

aws ecs list-services --cluster $ECS_CLUSTER

Log groups are resolved by pattern (default

/ecs/{service}

). Override with

ECS_LOG_PATTERN

# If your log groups are /ecs/prod/my-api, /ecs/prod/my-frontend, etc.
ECS_LOG_PATTERN="/ecs/prod/{service}" ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh diagnose

Integration

The health monitor can trigger the log analyzer for auto-diagnosis when issues are detected. Set

ECS_HEALTH_OUTDIR

to a shared directory and run both scripts together:

export ECS_CLUSTER=my-cluster
export ECS_DOMAIN=example.com
export ECS_HEALTH_OUTDIR=./data

# Run health check (auto-triggers log analysis on failure)
./scripts/ecs-health.sh

# Or run log analysis independently
./scripts/cloudwatch-logs.sh auto-diagnose --minutes 30

Error Categories

The log analyzer classifies errors into:

```
panic
```
— Go panics
```
fatal
```
— Fatal errors
```
oom
```
— Out of memory
```
timeout
```
— Connection/request timeouts
```
connection_error
```
— Connection refused/reset
```
http_5xx
```
— HTTP 500-level responses
```
python_traceback
```
— Python tracebacks
```
exception
```
— Generic exceptions
```
auth_error
```
— Permission/authorization failures
```
structured_error
```
— JSON-structured error logs
```
error
```
— Generic ERROR-level messages

Health check noise (GET/HEAD

/health

from ALB) is automatically filtered from error counts and HTTP status distribution.