Developer-kit aws-cloudformation-cloudwatch
Provides AWS CloudFormation patterns for CloudWatch monitoring, metrics, alarms, dashboards, logs, and observability. Use when creating CloudWatch metrics, alarms, dashboards, log groups, log subscriptions, anomaly detection, synthesized canaries, Application Signals, and implementing template structure with Parameters, Outputs, Mappings, Conditions, cross-stack references, and CloudWatch best practices for monitoring production infrastructure.
git clone https://github.com/giuseppe-trisciuoglio/developer-kit
T=$(mktemp -d) && git clone --depth=1 https://github.com/giuseppe-trisciuoglio/developer-kit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/developer-kit-aws/skills/aws-cloudformation/aws-cloudformation-cloudwatch" ~/.claude/skills/giuseppe-trisciuoglio-developer-kit-aws-cloudformation-cloudwatch && rm -rf "$T"
plugins/developer-kit-aws/skills/aws-cloudformation/aws-cloudformation-cloudwatch/SKILL.mdAWS CloudFormation CloudWatch Monitoring
Overview
Creates CloudWatch monitoring infrastructure using CloudFormation templates: metrics, alarms, dashboards, log groups, anomaly detection, synthesized canaries, and Application Signals.
When to Use
- Creating CloudWatch metrics and alarms for production infrastructure
- Building CloudWatch dashboards for multi-region visualization
- Implementing log groups with retention, encryption, and metric filters
- Configuring anomaly detection and composite alarms
- Setting up cross-stack references with Parameters and Outputs
- Validating and deploying monitoring stacks with CloudFormation
Instructions
Follow these steps to create CloudWatch monitoring infrastructure with CloudFormation:
1. Define Alarm Parameters
Specify metric namespaces, dimensions, and threshold values:
Parameters: ErrorRateThreshold: Type: Number Default: 5 Description: Error rate threshold for alarms (percentage) LatencyThreshold: Type: Number Default: 1000 Description: Latency threshold in milliseconds CpuUtilizationThreshold: Type: Number Default: 80 Description: CPU utilization threshold (percentage) LogRetentionDays: Type: Number Default: 30 AllowedValues: - 1 - 3 - 7 - 14 - 30 - 60 - 90 - 120 - 365 Description: Number of days to retain log events
2. Create CloudWatch Alarms
Set up alarms for CPU, memory, disk, and custom metrics:
Resources: HighCpuAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: !Sub "${AWS::StackName}-high-cpu" AlarmDescription: Trigger when CPU utilization exceeds threshold MetricName: CPUUtilization Namespace: AWS/EC2 Dimensions: - Name: InstanceId Value: !Ref InstanceId Statistic: Average Period: 60 EvaluationPeriods: 3 Threshold: !Ref CpuUtilizationThreshold ComparisonOperator: GreaterThanThreshold AlarmActions: - !Ref AlarmTopic ErrorRateAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: !Sub "${AWS::StackName}-error-rate" MetricName: ErrorRate Namespace: !Ref CustomNamespace Dimensions: - Name: Service Value: !Ref ServiceName Statistic: Average Period: 60 EvaluationPeriods: 5 Threshold: !Ref ErrorRateThreshold ComparisonOperator: GreaterThanThreshold
3. Configure Alarm Actions
Define SNS topics for notification delivery:
Resources: AlarmNotificationTopic: Type: AWS::SNS::Topic Properties: DisplayName: !Sub "${AWS::StackName}-alarms" TopicName: !Sub "${AWS::StackName}-alarms" AlarmTopicPolicy: Type: AWS::SNS::TopicPolicy Properties: PolicyDocument: Statement: - Effect: Allow Principal: Service: cloudwatch.amazonaws.com Action: sns:Publish Resource: !Ref AlarmNotificationTopic Topics: - !Ref AlarmNotificationTopic
4. Create Dashboards
Build visualization widgets for metrics across resources:
Resources: MonitoringDashboard: Type: AWS::CloudWatch::Dashboard Properties: DashboardName: !Sub "${AWS::StackName}-dashboard" DashboardBody: !Sub | { "widgets": [ { "type": "metric", "x": 0, "y": 0, "width": 12, "height": 6, "properties": { "title": "CPU Utilization", "metrics": [["AWS/EC2", "CPUUtilization", "InstanceId", "${InstanceId}"]], "period": 300, "stat": "Average", "region": "${AWS::Region}" } } ] }
5. Set Up Log Groups
Configure retention policies and encryption settings:
Resources: ApplicationLogGroup: Type: AWS::Logs::LogGroup Properties: LogGroupName: !Sub "/aws/applications/${Environment}/${ApplicationName}" RetentionInDays: !Ref LogRetentionDays KmsKeyId: !Ref LogEncryptionKey
6. Implement Metric Filters
Create metrics from log data:
Resources: ErrorMetricFilter: Type: AWS::Logs::MetricFilter Properties: LogGroupName: !Ref ApplicationLogGroup FilterPattern: '[level="ERROR", msg]' MetricTransformations: - MetricValue: "1" MetricNamespace: !Sub "${AWS::StackName}/Application" MetricName: ErrorCount
7. Add Composite Alarms
Build multi-condition alarm logic:
Resources: SystemHealthComposite: Type: AWS::CloudWatch::CompositeAlarm Properties: AlarmName: !Sub "${AWS::StackName}-system-health" AlarmRule: !Or - !Ref HighCpuAlarm - !Ref ErrorRateAlarm AlarmActions: - !Ref AlarmTopic
8. Configure Log Insights Queries
Create saved queries for log analysis:
Resources: ErrorAnalysisQuery: Type: AWS::Logs::QueryDefinition Properties: Name: !Sub "${AWS::StackName}-errors" LogGroupNames: - !Ref ApplicationLogGroup QueryString: | fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 100
9. Validate Template
Before deploying, validate the CloudFormation template:
aws cloudformation validate-template --template-body file://template.yaml
For parameterized templates, test with sample values:
aws cloudformation validate-template \ --template-body file://monitoring.yaml \ --capabilities CAPABILITY_IAM
10. Deploy and Verify
Deploy the stack and verify resources:
# Deploy stack aws cloudformation create-stack \ --stack-name my-monitoring-stack \ --template-body file://monitoring.yaml \ --parameters file://parameters.json \ --capabilities CAPABILITY_IAM # Wait for completion aws cloudformation wait stack-create-complete \ --stack-name my-monitoring-stack # Verify alarms are in OK state aws cloudwatch describe-alarms --stack-name my-monitoring-stack # Check dashboard accessibility aws cloudwatch get-dashboard --dashboard-name my-monitoring-stack-dashboard
Test alarm actions before production:
# Set alarm to testing state aws cloudwatch set-alarm-state \ --alarm-name my-alarm \ --state-value ALARM \ --state-reason "Testing alarm action"
Best Practices
- Use composite alarms to reduce noise and avoid false positives
- Set meaningful thresholds based on baseline metrics
- Configure appropriate evaluation periods (3-5 datapoints)
- Enable anomaly detection for metrics with variable patterns
- Use metric math for derived metrics (error rates, latency percentiles)
- Set appropriate log retention based on compliance needs
- Encrypt log groups with sensitive data using KMS
- Test alarm actions with
before productionset-alarm-state
Examples
Example 1: Complete Monitoring Stack
AWSTemplateFormatVersion: '2010-09-09' Description: Complete CloudWatch monitoring setup Parameters: Environment: Type: String Default: prod AllowedValues: [dev, staging, prod] Resources: # SNS Topic for notifications AlarmTopic: Type: AWS::SNS::Topic Properties: DisplayName: !Sub "${Environment}-alarms" # Alarm for high CPU CpuAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: !Sub "${AWS::StackName}-cpu-high" MetricName: CPUUtilization Namespace: AWS/EC2 Statistic: Average Period: 300 EvaluationPeriods: 2 Threshold: 80 ComparisonOperator: GreaterThanThreshold AlarmActions: - !Ref AlarmTopic # Dashboard Dashboard: Type: AWS::CloudWatch::Dashboard Properties: DashboardName: !Ref AWS::StackName DashboardBody: !Sub | { "widgets": [{ "type": "metric", "properties": { "metrics": [["AWS/EC2", "CPUUtilization"]], "period": 300, "stat": "Average" } }] }
Example 2: Log-Based Metrics with Filters
Resources: AppLogGroup: Type: AWS::Logs::LogGroup Properties: LogGroupName: !Sub "/app/${Environment}" RetentionInDays: 30 ErrorMetricFilter: Type: AWS::Logs::MetricFilter Properties: LogGroupName: !Ref AppLogGroup FilterPattern: '"ERROR"' MetricTransformations: - MetricValue: "1" MetricNamespace: !Sub "${AWS::StackName}/App" MetricName: ErrorCount ErrorAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: !Sub "${AWS::StackName}-errors" MetricName: ErrorCount MetricNamespace: !Sub "${AWS::StackName}/App" Statistic: Sum Period: 300 EvaluationPeriods: 1 Threshold: 1 ComparisonOperator: GreaterThanOrEqualToThreshold
Constraints and Warnings
Resource Limits
- Maximum 5000 alarms per account
- Maximum 50 metric filters per log group
- Dashboard body limited to 512KB
- Maximum 200 metrics per alarm expression
Security Considerations
- Enable encryption on log groups containing sensitive data
- Use IAM policies to restrict alarm action permissions
- Avoid hardcoding thresholds in templates; use Parameters
- Protect SNS topics with encryption and access policies
Operational Warnings
- High-resolution alarms (10s/30s) incur higher costs
- Missing data treatment affects alarm state; configure explicitly
- Composite alarms can mask individual alarm states
- Metric filters are immutable after creation; recreate to change
Cost Optimization
- Use standard resolution (5-minute period) unless necessary
- Set appropriate retention periods (shorter = cheaper)
- Delete unused dashboards and custom metrics
- Monitor CloudWatch charges with budgets and alerts
Monitoring Strategy
- Use composite alarms to reduce alarm noise
- Configure appropriate evaluation periods to avoid false positives
- Set up anomaly detection for metrics with variable patterns
- Use metric math for derived metrics (error rates, averages)
- Implement high-resolution alarms for critical metrics
- Create separate dashboards for different audiences (ops, dev, management)
Log Management
- Use appropriate retention periods based on compliance requirements
- Encrypt log groups containing sensitive data
- Implement metric filters for critical log patterns
- Set up cross-account log aggregation for centralized analysis
- Use CloudWatch Logs Insights for troubleshooting
- Configure log subscriptions to Lambda/Kinesis for real-time processing
Alarm Configuration
- Set meaningful thresholds based on baseline metrics
- Use datapoints-to-alarm for reliability
- Configure OK actions to reset notifications
- Treat missing data appropriately (breaching, not breaching, ignore)
- Test alarm actions regularly
- Document alarm runbooks and escalation procedures
Template Structure
- Use AWS-specific parameter types for resources
- Implement parameter constraints for validation
- Use Conditions for environment-specific configuration
- Leverage Mappings for region-specific settings
- Apply Metadata for parameter grouping
- Use nested stacks for large monitoring setups
Dashboard Design
- Organize dashboards by service or application tier
- Use consistent widget layouts and sizing
- Include text widgets for context and documentation
- Set appropriate time ranges for data visualization
- Use variables for dynamic dashboard filtering
- Limit metrics per dashboard to avoid performance issues
References
For detailed implementation guidance, see:
-
alarms.md - CloudWatch metrics and alarms including base metric alarms, latency alarms, API Gateway errors, EC2 instance alarms, Lambda function alarms, composite alarms, anomaly detection, metric math, alarm actions (SNS, Auto Scaling, EC2), missing data treatment, custom metrics, metric filters, and high-resolution alarms
-
dashboards.md - CloudWatch dashboards including base template, service-specific dashboards, widget types (metric, log, text, single value, alarm status), multi-region dashboards, stacked metrics, anomaly detection widgets, math expressions, layout patterns (grid, row, column), dynamic variables, cross-account sharing, and dashboard automation
-
logs.md - CloudWatch logs including log group configurations, metric filters, subscription filters (Lambda, Kinesis Firehose), cross-account log aggregation, log insights queries, resource policies, export and archive tasks, CloudWatch agent configuration, log encryption with KMS, lifecycle management, centralized logging, and search patterns
-
constraints.md - Resource limits (5000 alarms max, 500 dashboards max), operational constraints (metric resolution, evaluation periods, dashboard widgets, cross-account), security constraints (log data access, encryption, metric filters, alarm actions), cost considerations (detailed monitoring, custom metrics, log retention, dashboard queries), and data constraints (metric age, log ingestion, filter limits)