cloudwatch
Debug production issues and monitor AWS infrastructure via CloudWatch. Use when the user reports errors, wants to investigate production behavior, check logs, debug OAuth, API errors, ECS tasks, database issues, WAF blocks, or any production incident. Also use when the user says "check logs", "what's failing", "why is X broken", "system status", "error report", "check alarms", or mentions CloudWatch, log groups, or alarms. Supports proactive commands: status (health check), report (error summary), alarms (alarm states), diff (error rate comparison). Run `/cloudwatch configure` to auto-discover your AWS infrastructure on first use.
git clone https://github.com/torrresagus/cloudwatch-debugger-skill
git clone --depth=1 https://github.com/torrresagus/cloudwatch-debugger-skill ~/.claude/skills/torrresagus-cloudwatch-debugger-skill-cloudwatch
SKILL.mdCloudWatch Log Debugger
Query, filter, and analyze AWS CloudWatch logs for production debugging. Auto-configures to any AWS environment.
Current State
- Current timestamp (epoch seconds): !
date +%s - Current time (human-readable): !
date '+%Y-%m-%d %H:%M:%S %Z'
First-Time Setup
If
config.json does not exist in this skill's directory, tell the user:
This skill needs to discover your AWS infrastructure first. Run
or let me auto-configure now./cloudwatch configure
Then read and follow the instructions in
scripts/configure.sh to generate config.json.
Configuration
Read
config.json from this skill's directory for all environment-specific values. The config contains:
— path to the AWS CLI binary (e.g.,aws_cli
oraws
)/snap/bin/aws
— AWS regionregion
— discovered log groups with their purpose and stream prefixeslog_groups
— which log group to query when the user doesn't specifydefault_log_group
— ECS clusters if anyecs_clusters
— CloudWatch alarms if anyalarms
— where to save log files (default:output_dir
)logs/
Use these values in all commands instead of hardcoded strings.
Command Dispatch
Parse
$ARGUMENTS to determine which command to run:
If starts with... | Action |
|---|---|
| Run configuration (see First-Time Setup) |
| Jump to Status Check below |
| Jump to Report below (remaining args = time range) |
| Jump to Alarms below |
| Jump to Error Rate Comparison below (remaining args = time windows) |
| anything else | Jump to Workflow below (reactive debugging) |
Status Check
Quick health dashboard. No arguments needed.
Read
config.json, then run these queries:
1. Error Count (last 30 min)
Run a Logs Insights query against app log groups (priority <= 2). Use
--log-group-names to batch:
QUERY_ID=$($AWS_CLI logs start-query \ --log-group-names "$LOG_GROUP_1" "$LOG_GROUP_2" \ --start-time $(date -d '30 minutes ago' +%s) \ --end-time $(date +%s) \ --query-string 'fields @timestamp, @message | filter @message like /ERROR|Exception|FATAL/ | stats count() as error_count by @logStream' \ --region $REGION --output text --query 'queryId')
Then
sleep 3, then get-query-results.
2. Alarm States (live)
Fetch current alarm states from the API — do NOT use cached values from config:
$AWS_CLI cloudwatch describe-alarms \ --region $REGION --output json \ --query 'MetricAlarms[].{name:AlarmName,state:StateValue,metric:MetricName,namespace:Namespace,threshold:Threshold}'
3. ECS Service Health
For each cluster/service in
config.ecs:
$AWS_CLI ecs describe-services \ --cluster $CLUSTER \ --services $SERVICE_ARN \ --region $REGION --output json \ --query 'services[].{name:serviceName,desired:desiredCount,running:runningCount,pending:pendingCount}'
4. Recently Stopped Tasks
$AWS_CLI ecs list-tasks --cluster $CLUSTER --desired-status STOPPED --region $REGION --output json
If any stopped tasks exist, describe them for crash reasons:
$AWS_CLI ecs describe-tasks --cluster $CLUSTER --tasks $TASK_ARNS \ --region $REGION --output json \ --query 'tasks[].{taskArn:taskArn,stoppedReason:stoppedReason,stopCode:stopCode,stoppedAt:stoppedAt,containers:containers[].{name:name,exitCode:exitCode,reason:reason}}'
5. CPU/Memory Utilization
$AWS_CLI cloudwatch get-metric-statistics \ --namespace AWS/ECS --metric-name CPUUtilization \ --dimensions Name=ClusterName,Value=$CLUSTER \ --start-time $(date -d '30 minutes ago' -u +%Y-%m-%dT%H:%M:%S) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \ --period 300 --statistics Average Maximum \ --region $REGION --output json
Same for
MemoryUtilization.
Output Format
Present as a dashboard summary:
## System Status (as of YYYY-MM-DD HH:MM:SS) ### Errors (last 30 min) - app-backend: 12 errors - app-frontend: 0 errors ### Alarms - OK: my-app-ECS-CPU-High (CPUUtilization < 80) - **ALARM: my-app-ApplicationErrors-High** (ErrorCount > 50) ### ECS Services - my-app-web: 2/2 running, 0 pending - my-app-worker: 1/1 running, 0 pending ### Resource Utilization (30-min avg) - CPU: 45% avg, 62% max - Memory: 71% avg, 78% max
Save to
$OUTPUT_DIR/YYYYMMDD_HHMMSS_status.txt.
Report
Periodic summary over a configurable time range. Parse the time range from the remaining arguments after
report (e.g., last 24 hours, last 6h, today). Default: last 1 hour.
Run these Logs Insights queries against app log groups:
1. Top Errors
fields @timestamp, @message | filter @message like /ERROR/ | parse @message '"message": "*"' as error_msg | stats count() as occurrences by error_msg | sort occurrences desc | limit 10
2. Error Trend
fields @timestamp, @message | filter @message like /ERROR/ | stats count() as errors by bin(5m) | sort @timestamp asc
For time ranges > 6 hours, use
bin(30m) instead of bin(5m).
3. P95 Latency
fields @timestamp, @message | filter @message like /request completed|duration/ | parse @message '"duration": *,' as duration_ms | stats avg(duration_ms) as avg_ms, max(duration_ms) as max_ms, pct(duration_ms, 95) as p95_ms by bin(5m) | sort @timestamp asc
4. Most Affected Endpoints
fields @timestamp, @message | filter @message like /ERROR/ | parse @message '"path": "*"' as endpoint | stats count() as errors by endpoint | sort errors desc | limit 10
Output Format
## Report: Last 1 Hour (HH:MM - HH:MM) ### Top Errors | # | Error | Count | |---|-------|-------| | 1 | ConnectionRefused: DB pool exhausted | 23 | | 2 | TokenExpiredError | 8 | ### Error Trend (5-min bins) HH:00 ██████████ 23 HH:05 ████ 8 HH:10 ██ 4 ... ### Latency - Average: 120ms - P95: 450ms - Max: 2300ms ### Most Affected Endpoints | Endpoint | Errors | |----------|--------| | /api/auth/callback | 15 | | /api/users/profile | 8 |
Save to
$OUTPUT_DIR/YYYYMMDD_HHMMSS_report.txt.
Alarms
List all CloudWatch alarms with their current state.
1. Fetch Live Alarm Data
$AWS_CLI cloudwatch describe-alarms \ --region $REGION --output json
2. Present Grouped by State
Group alarms by state. Show ALARM state first (highlighted), then OK, then INSUFFICIENT_DATA.
For each alarm, show:
- Alarm name
- Metric and namespace
- Threshold and comparison operator
- Evaluation periods and period length
- State reason (for alarms not in OK state)
3. Map to Log Groups
Map alarm namespaces to log group categories for investigation suggestions:
→ ecs-app → suggestAWS/ApplicationELB/cloudwatch 500 errors
→ container-insights → suggestAWS/ECS/cloudwatch ECS task crashes
→ rds → suggestAWS/RDS/cloudwatch database errors
→ lambda → suggestAWS/Lambda/cloudwatch lambda errors
Output Format
## CloudWatch Alarms ### ALARM (1) - **my-app-ApplicationErrors-High** Metric: AWS/ApplicationELB > ErrorCount Condition: ErrorCount > 50 for 1 period(s) of 300s Reason: Threshold crossed... → Investigate: /cloudwatch 500 errors in the last hour ### OK (2) - my-app-ECS-CPU-High Metric: AWS/ECS > CPUUtilization Condition: CPUUtilization > 80 for 2 period(s) of 300s ### INSUFFICIENT_DATA (0) None.
Save to
$OUTPUT_DIR/YYYYMMDD_HHMMSS_alarms.txt.
Error Rate Comparison
Compare error rates between two time windows to detect regressions or confirm fixes.
1. Parse Time Windows
From the remaining arguments after
diff. Defaults:
- Window A (current): last 30 minutes
- Window B (baseline): 30–60 minutes ago
Support natural language like:
last 1h vs yesterday same timelast 30m vs 2h ago
(user should provide timestamps)post-deploy vs pre-deploy
2. Run Error Count for Both Windows
Use
--log-group-names to batch app log groups into one query per window:
fields @timestamp, @message | filter @message like /ERROR|Exception|FATAL/ | stats count() as error_count
Run this query twice: once with Window A timestamps, once with Window B timestamps.
3. Run Error-by-Type for Both Windows
fields @timestamp, @message | filter @message like /ERROR/ | parse @message '"message": "*"' as error_msg | stats count() as cnt by error_msg | sort cnt desc | limit 15
Run this query twice for both windows.
4. Compute and Present
Calculate:
- Percentage change:
((A - B) / B) * 100 - New errors: errors in Window A that don't appear in Window B
- Resolved errors: errors in Window B that don't appear in Window A
Output Format
## Error Rate Comparison **Window A (current):** HH:MM - HH:MM **Window B (baseline):** HH:MM - HH:MM ### Summary | Log Group | Baseline | Current | Change | |-----------|----------|---------|--------| | app-backend | 5 | 23 | +360% ↑ | | app-frontend | 2 | 1 | -50% ↓ | ### New/Changed Errors | Error | Baseline | Current | Delta | |-------|----------|---------|-------| | DB pool exhausted | 0 | 18 | **NEW** | | TokenExpired | 3 | 2 | -33% | ### Assessment **REGRESSION** — Error rate increased 360% in app-backend. Primary cause: "DB pool exhausted" (18 new occurrences). Recommendation: Check database connection pool settings.
Label the assessment as:
- REGRESSION — if current errors are significantly higher (>20% increase)
- IMPROVEMENT — if current errors are lower (>20% decrease)
- STABLE — if change is within ±20%
Save to
$OUTPUT_DIR/YYYYMMDD_HHMMSS_diff.txt.
Workflow
Step 1: Understand the Problem
If
$ARGUMENTS was provided (e.g., the user ran /cloudwatch 500 errors in the last hour), use it as the problem description and skip clarification.
Otherwise, determine:
- What happened? — Error message, HTTP status code, user report
- When? — Convert relative times to absolute timestamps using the Current State above
- Which log group? — Match the problem to a log group from
. If unsure, useconfig.jsondefault_log_group
Step 2: Query Logs
Always use
--region from config. Use the aws_cli path from config.
Quick Search (filter by pattern)
$AWS_CLI logs filter-log-events \ --log-group-name "$LOG_GROUP" \ --filter-pattern "ERROR" \ --start-time $(date -d '1 hour ago' +%s000) \ --region $REGION \ --output json
Logs Insights (preferred for analysis)
QUERY_ID=$($AWS_CLI logs start-query \ --log-group-name "$LOG_GROUP" \ --start-time $(date -d '1 hour ago' +%s) \ --end-time $(date +%s) \ --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50' \ --region $REGION \ --output text --query 'queryId') sleep 3 $AWS_CLI logs get-query-results \ --query-id "$QUERY_ID" \ --region $REGION \ --output json
Important: Always run
sleep and get-query-results as separate commands, never chained with &&. This avoids allowed-tools pattern matching issues.
Filter by Log Stream
Use
--log-stream-name-prefix to narrow down to a specific container or service, based on the stream prefixes in config.json.
Step 3: Save Results to File
MANDATORY after every query:
mkdir -p $OUTPUT_DIR $AWS_CLI logs filter-log-events ... > "$OUTPUT_DIR/$(date +%Y%m%d_%H%M%S)_description.txt"
Naming convention:
YYYYMMDD_HHMMSS_<description>.txt
After saving, read the file and present a summary. Always tell the user the file path.
Step 4: Analyze and Diagnose
- Identify root cause — find the actual exception, stack trace, or error message
- Check correlation IDs — if the app uses correlation IDs, trace the request across log entries
- Cross-reference with code — if the error points to a file/function, read that code
- Check related log groups — DB errors → check RDS logs. Blocked requests → check WAF logs
Step 5: Report to User
Present findings as:
- What happened — the actual error/exception
- When — timestamp(s)
- Where — which service, file, function
- Why — root cause analysis
- Fix — suggested code change or configuration fix
- File — path to the saved log file
Additional Resources
- For common debugging scenarios (auth, 500s, DB, WAF, ECS, network, tracing), read
references/scenarios.md - For Logs Insights query recipes (aggregations, latency analysis, error trends), read
references/recipes.md - For monitoring command templates (metrics, ECS health, alarms, batching), read
references/monitoring.md
Important Notes
- Log retention varies — check your account settings. For older logs, check S3 archive buckets if configured
- Structured JSON logs: If the app outputs JSON, use field-based filtering:
,timestamp
,level
,message
, etc.correlation_id - Time format:
expects epoch milliseconds forfilter-log-events
. Logs Insights expects epoch seconds--start-time - Output format: Always use
for results. Use--output json
only for extracting simple values like query IDs--output text - Rate limits: CloudWatch API has rate limits. If throttled, wait a few seconds and retry
- Never expose secrets in log output or saved files. Redact tokens, keys, or credentials before showing to the user