cloudwatch

Debug production issues and monitor AWS infrastructure via CloudWatch. Use when the user reports errors, wants to investigate production behavior, check logs, debug OAuth, API errors, ECS tasks, database issues, WAF blocks, or any production incident. Also use when the user says "check logs", "what's failing", "why is X broken", "system status", "error report", "check alarms", or mentions CloudWatch, log groups, or alarms. Supports proactive commands: status (health check), report (error summary), alarms (alarm states), diff (error rate comparison). Run `/cloudwatch configure` to auto-discover your AWS infrastructure on first use.

install

source · Clone the upstream repo

git clone https://github.com/torrresagus/cloudwatch-debugger-skill

Claude Code · Install into ~/.claude/skills/

git clone --depth=1 https://github.com/torrresagus/cloudwatch-debugger-skill ~/.claude/skills/torrresagus-cloudwatch-debugger-skill-cloudwatch

manifest: SKILL.md

source content

CloudWatch Log Debugger

Query, filter, and analyze AWS CloudWatch logs for production debugging. Auto-configures to any AWS environment.

Current State

Current timestamp (epoch seconds): !
```
date +%s
```
Current time (human-readable): !
```
date '+%Y-%m-%d %H:%M:%S %Z'
```

First-Time Setup

config.json

does not exist in this skill's directory, tell the user:

This skill needs to discover your AWS infrastructure first. Run
/cloudwatch configure
or let me auto-configure now.

Then read and follow the instructions in

scripts/configure.sh

to generate

config.json

Configuration

Read

config.json

from this skill's directory for all environment-specific values. The config contains:

```
aws_cli
```
— path to the AWS CLI binary (e.g.,
```
aws
```
or
```
/snap/bin/aws
```
)
```
region
```
— AWS region
```
log_groups
```
— discovered log groups with their purpose and stream prefixes
```
default_log_group
```
— which log group to query when the user doesn't specify
```
ecs_clusters
```
— ECS clusters if any
```
alarms
```
— CloudWatch alarms if any
```
output_dir
```
— where to save log files (default:
```
logs/
```
)

Use these values in all commands instead of hardcoded strings.

Command Dispatch

Parse

$ARGUMENTS

to determine which command to run:

If `$ARGUMENTS` starts with...	Action
`configure`	Run configuration (see First-Time Setup)
`status`	Jump to Status Check below
`report`	Jump to Report below (remaining args = time range)
`alarms`	Jump to Alarms below
`diff`	Jump to Error Rate Comparison below (remaining args = time windows)
anything else	Jump to Workflow below (reactive debugging)

Status Check

Quick health dashboard. No arguments needed.

Read

config.json

, then run these queries:

1. Error Count (last 30 min)

Run a Logs Insights query against app log groups (priority <= 2). Use

--log-group-names

to batch:

QUERY_ID=$($AWS_CLI logs start-query \
  --log-group-names "$LOG_GROUP_1" "$LOG_GROUP_2" \
  --start-time $(date -d '30 minutes ago' +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, @message | filter @message like /ERROR|Exception|FATAL/ | stats count() as error_count by @logStream' \
  --region $REGION --output text --query 'queryId')

Then

sleep 3

, then

get-query-results

2. Alarm States (live)

Fetch current alarm states from the API — do NOT use cached values from config:

$AWS_CLI cloudwatch describe-alarms \
  --region $REGION --output json \
  --query 'MetricAlarms[].{name:AlarmName,state:StateValue,metric:MetricName,namespace:Namespace,threshold:Threshold}'

3. ECS Service Health

For each cluster/service in

config.ecs

$AWS_CLI ecs describe-services \
  --cluster $CLUSTER \
  --services $SERVICE_ARN \
  --region $REGION --output json \
  --query 'services[].{name:serviceName,desired:desiredCount,running:runningCount,pending:pendingCount}'

4. Recently Stopped Tasks

$AWS_CLI ecs list-tasks --cluster $CLUSTER --desired-status STOPPED --region $REGION --output json

If any stopped tasks exist, describe them for crash reasons:

$AWS_CLI ecs describe-tasks --cluster $CLUSTER --tasks $TASK_ARNS \
  --region $REGION --output json \
  --query 'tasks[].{taskArn:taskArn,stoppedReason:stoppedReason,stopCode:stopCode,stoppedAt:stoppedAt,containers:containers[].{name:name,exitCode:exitCode,reason:reason}}'

5. CPU/Memory Utilization

$AWS_CLI cloudwatch get-metric-statistics \
  --namespace AWS/ECS --metric-name CPUUtilization \
  --dimensions Name=ClusterName,Value=$CLUSTER \
  --start-time $(date -d '30 minutes ago' -u +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 --statistics Average Maximum \
  --region $REGION --output json

Same for

MemoryUtilization

Output Format

Present as a dashboard summary:

## System Status (as of YYYY-MM-DD HH:MM:SS)

### Errors (last 30 min)
- app-backend: 12 errors
- app-frontend: 0 errors

### Alarms
- OK: my-app-ECS-CPU-High (CPUUtilization < 80)
- **ALARM: my-app-ApplicationErrors-High** (ErrorCount > 50)

### ECS Services
- my-app-web: 2/2 running, 0 pending
- my-app-worker: 1/1 running, 0 pending

### Resource Utilization (30-min avg)
- CPU: 45% avg, 62% max
- Memory: 71% avg, 78% max

Save to

$OUTPUT_DIR/YYYYMMDD_HHMMSS_status.txt

Report

Periodic summary over a configurable time range. Parse the time range from the remaining arguments after

report

(e.g.,

last 24 hours

last 6h

today

). Default: last 1 hour.

Run these Logs Insights queries against app log groups:

1. Top Errors

fields @timestamp, @message
| filter @message like /ERROR/
| parse @message '"message": "*"' as error_msg
| stats count() as occurrences by error_msg
| sort occurrences desc
| limit 10

2. Error Trend

fields @timestamp, @message
| filter @message like /ERROR/
| stats count() as errors by bin(5m)
| sort @timestamp asc

For time ranges > 6 hours, use

bin(30m)

instead of

bin(5m)

3. P95 Latency

fields @timestamp, @message
| filter @message like /request completed|duration/
| parse @message '"duration": *,' as duration_ms
| stats avg(duration_ms) as avg_ms, max(duration_ms) as max_ms, pct(duration_ms, 95) as p95_ms by bin(5m)
| sort @timestamp asc

4. Most Affected Endpoints

fields @timestamp, @message
| filter @message like /ERROR/
| parse @message '"path": "*"' as endpoint
| stats count() as errors by endpoint
| sort errors desc
| limit 10

Output Format

## Report: Last 1 Hour (HH:MM - HH:MM)

### Top Errors
| # | Error | Count |
|---|-------|-------|
| 1 | ConnectionRefused: DB pool exhausted | 23 |
| 2 | TokenExpiredError | 8 |

### Error Trend (5-min bins)
HH:00  ██████████ 23
HH:05  ████ 8
HH:10  ██ 4
...

### Latency
- Average: 120ms
- P95: 450ms
- Max: 2300ms

### Most Affected Endpoints
| Endpoint | Errors |
|----------|--------|
| /api/auth/callback | 15 |
| /api/users/profile | 8 |

Save to

$OUTPUT_DIR/YYYYMMDD_HHMMSS_report.txt

Alarms

List all CloudWatch alarms with their current state.

1. Fetch Live Alarm Data

$AWS_CLI cloudwatch describe-alarms \
  --region $REGION --output json

2. Present Grouped by State

Group alarms by state. Show ALARM state first (highlighted), then OK, then INSUFFICIENT_DATA.

For each alarm, show:

Alarm name
Metric and namespace
Threshold and comparison operator
Evaluation periods and period length
State reason (for alarms not in OK state)

3. Map to Log Groups

Map alarm namespaces to log group categories for investigation suggestions:

AWS/ApplicationELB

→ ecs-app → suggest

/cloudwatch 500 errors

```
AWS/ECS
```
→ container-insights → suggest
```
/cloudwatch ECS task crashes
```
```
AWS/RDS
```
→ rds → suggest
```
/cloudwatch database errors
```
```
AWS/Lambda
```
→ lambda → suggest
```
/cloudwatch lambda errors
```

Output Format

## CloudWatch Alarms

### ALARM (1)
- **my-app-ApplicationErrors-High**
  Metric: AWS/ApplicationELB > ErrorCount
  Condition: ErrorCount > 50 for 1 period(s) of 300s
  Reason: Threshold crossed...
  → Investigate: /cloudwatch 500 errors in the last hour

### OK (2)
- my-app-ECS-CPU-High
  Metric: AWS/ECS > CPUUtilization
  Condition: CPUUtilization > 80 for 2 period(s) of 300s

### INSUFFICIENT_DATA (0)
None.

Save to

$OUTPUT_DIR/YYYYMMDD_HHMMSS_alarms.txt

Error Rate Comparison

Compare error rates between two time windows to detect regressions or confirm fixes.

1. Parse Time Windows

From the remaining arguments after

diff

. Defaults:

Window A (current): last 30 minutes
Window B (baseline): 30–60 minutes ago

Support natural language like:

```
last 1h vs yesterday same time
```
```
last 30m vs 2h ago
```
```
post-deploy vs pre-deploy
```
(user should provide timestamps)

2. Run Error Count for Both Windows

Use

--log-group-names

to batch app log groups into one query per window:

fields @timestamp, @message
| filter @message like /ERROR|Exception|FATAL/
| stats count() as error_count

Run this query twice: once with Window A timestamps, once with Window B timestamps.

3. Run Error-by-Type for Both Windows

fields @timestamp, @message
| filter @message like /ERROR/
| parse @message '"message": "*"' as error_msg
| stats count() as cnt by error_msg
| sort cnt desc
| limit 15

Run this query twice for both windows.

4. Compute and Present

Calculate:

Percentage change:
```
((A - B) / B) * 100
```
New errors: errors in Window A that don't appear in Window B
Resolved errors: errors in Window B that don't appear in Window A

Output Format

## Error Rate Comparison

**Window A (current):** HH:MM - HH:MM
**Window B (baseline):** HH:MM - HH:MM

### Summary
| Log Group | Baseline | Current | Change |
|-----------|----------|---------|--------|
| app-backend | 5 | 23 | +360% ↑ |
| app-frontend | 2 | 1 | -50% ↓ |

### New/Changed Errors
| Error | Baseline | Current | Delta |
|-------|----------|---------|-------|
| DB pool exhausted | 0 | 18 | **NEW** |
| TokenExpired | 3 | 2 | -33% |

### Assessment
**REGRESSION** — Error rate increased 360% in app-backend.
Primary cause: "DB pool exhausted" (18 new occurrences).
Recommendation: Check database connection pool settings.

Label the assessment as:

REGRESSION — if current errors are significantly higher (>20% increase)
IMPROVEMENT — if current errors are lower (>20% decrease)
STABLE — if change is within ±20%

Save to

$OUTPUT_DIR/YYYYMMDD_HHMMSS_diff.txt

Workflow

Step 1: Understand the Problem

$ARGUMENTS

was provided (e.g., the user ran

/cloudwatch 500 errors in the last hour

), use it as the problem description and skip clarification.

Otherwise, determine:

What happened? — Error message, HTTP status code, user report
When? — Convert relative times to absolute timestamps using the Current State above
Which log group? — Match the problem to a log group from
```
config.json
```
. If unsure, use
```
default_log_group
```

Step 2: Query Logs

Always use

--region

from config. Use the

aws_cli

path from config.

Quick Search (filter by pattern)

$AWS_CLI logs filter-log-events \
  --log-group-name "$LOG_GROUP" \
  --filter-pattern "ERROR" \
  --start-time $(date -d '1 hour ago' +%s000) \
  --region $REGION \
  --output json

Logs Insights (preferred for analysis)

QUERY_ID=$($AWS_CLI logs start-query \
  --log-group-name "$LOG_GROUP" \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50' \
  --region $REGION \
  --output text --query 'queryId')

sleep 3

$AWS_CLI logs get-query-results \
  --query-id "$QUERY_ID" \
  --region $REGION \
  --output json

Important: Always run

sleep

and

get-query-results

as separate commands, never chained with

&&

. This avoids allowed-tools pattern matching issues.

Filter by Log Stream

Use

--log-stream-name-prefix

to narrow down to a specific container or service, based on the stream prefixes in

config.json

Step 3: Save Results to File

MANDATORY after every query:

mkdir -p $OUTPUT_DIR
$AWS_CLI logs filter-log-events ... > "$OUTPUT_DIR/$(date +%Y%m%d_%H%M%S)_description.txt"

Naming convention:

YYYYMMDD_HHMMSS_<description>.txt

After saving, read the file and present a summary. Always tell the user the file path.

Step 4: Analyze and Diagnose

Identify root cause — find the actual exception, stack trace, or error message
Check correlation IDs — if the app uses correlation IDs, trace the request across log entries
Cross-reference with code — if the error points to a file/function, read that code
Check related log groups — DB errors → check RDS logs. Blocked requests → check WAF logs

Step 5: Report to User

Present findings as:

What happened — the actual error/exception
When — timestamp(s)
Where — which service, file, function
Why — root cause analysis
Fix — suggested code change or configuration fix
File — path to the saved log file

Additional Resources

For common debugging scenarios (auth, 500s, DB, WAF, ECS, network, tracing), read
```
references/scenarios.md
```
For Logs Insights query recipes (aggregations, latency analysis, error trends), read
```
references/recipes.md
```
For monitoring command templates (metrics, ECS health, alarms, batching), read
```
references/monitoring.md
```

Important Notes

Log retention varies — check your account settings. For older logs, check S3 archive buckets if configured
Structured JSON logs: If the app outputs JSON, use field-based filtering:
```
timestamp
```
,
```
level
```
,
```
message
```
,
```
correlation_id
```
, etc.
Time format:
```
filter-log-events
```
expects epoch milliseconds for
```
--start-time
```
. Logs Insights expects epoch seconds
Output format: Always use
```
--output json
```
for results. Use
```
--output text
```
only for extracting simple values like query IDs
Rate limits: CloudWatch API has rate limits. If throttled, wait a few seconds and retry
Never expose secrets in log output or saved files. Redact tokens, keys, or credentials before showing to the user