Claude-Skills senior-cloud-architect

Senior Cloud Architect

install
source · Clone the upstream repo
git clone https://github.com/borghei/Claude-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/borghei/Claude-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/engineering/senior-cloud-architect" ~/.claude/skills/borghei-claude-skills-senior-cloud-architect && rm -rf "$T"
manifest: engineering/senior-cloud-architect/SKILL.md
source content

Senior Cloud Architect

Expert cloud architecture and infrastructure design across AWS, GCP, and Azure.

Keywords

cloud, aws, gcp, azure, terraform, infrastructure, vpc, eks, ecs, lambda, cost-optimization, disaster-recovery, multi-region, iam, security, migration


Quick Start

# Analyze infrastructure costs
python scripts/cost_analyzer.py --account production --period monthly

# Run DR validation
python scripts/dr_test.py --region us-west-2 --type failover

# Audit security posture
python scripts/security_audit.py --framework cis --output report.html

# Generate resource inventory
python scripts/inventory.py --accounts all --format csv

Tools

ScriptPurpose
scripts/cost_analyzer.py
Analyze cloud spend by service, environment, and tag
scripts/dr_test.py
Validate disaster recovery failover procedures
scripts/security_audit.py
Audit against CIS benchmarks and compliance frameworks
scripts/inventory.py
Inventory all resources across accounts and regions

Cloud Platform Comparison

ServiceAWSGCPAzure
ComputeEC2, ECS, EKSGCE, GKEVMs, AKS
ServerlessLambdaCloud FunctionsAzure Functions
StorageS3Cloud StorageBlob Storage
DatabaseRDS, DynamoDBCloud SQL, SpannerSQL DB, CosmosDB
MLSageMakerVertex AIAzure ML
CDNCloudFrontCloud CDNAzure CDN

Workflow 1: Design a Production AWS Architecture

  1. Define requirements -- Identify compute, storage, database, and networking needs. Determine RTO/RPO targets.
  2. Provision VPC with Terraform:
    module "vpc" {
      source  = "terraform-aws-modules/vpc/aws"
      version = "~> 5.0"
      name    = "${var.project}-${var.environment}"
      cidr    = var.vpc_cidr
      azs             = ["${var.region}a", "${var.region}b", "${var.region}c"]
      private_subnets = var.private_subnets
      public_subnets  = var.public_subnets
      enable_nat_gateway   = true
      single_nat_gateway   = var.environment != "production"
      enable_dns_hostnames = true
      tags = local.common_tags
    }
    
  3. Deploy compute -- ECS/EKS in private subnets behind an ALB in public subnets. Use at least 2 AZs for redundancy.
  4. Configure database -- RDS Multi-AZ for production, single-AZ for staging. Set backup retention to 30 days (production) or 7 days (non-production).
  5. Add caching layer -- ElastiCache (Redis) between application and database.
  6. Layer security -- WAF on CloudFront, NACLs on subnets, security groups on instances. Apply least-privilege IAM.
  7. Validate -- Run
    python scripts/security_audit.py --framework cis
    and resolve all high-severity findings.

Reference Architecture

Route 53 (DNS) -> CloudFront + WAF -> ALB
  -> ECS/EKS Cluster (AZ-a) + ECS/EKS Cluster (AZ-b)
    -> ElastiCache (Redis)
      -> RDS Multi-AZ (Primary + Standby)

Workflow 2: Optimize Cloud Costs

  1. Audit current spend --
    python scripts/cost_analyzer.py --account production --period monthly
  2. Right-size instances -- Identify instances with avg CPU <10% and max CPU <30% as downsize candidates:
    # Pseudocode for right-sizing logic
    if avg_cpu < 10 and max_cpu < 30:
        recommendation = 'downsize'
    elif avg_cpu > 80:
        recommendation = 'upsize'
    else:
        recommendation = 'optimal'
    
  3. Convert steady-state workloads to Reserved Instances or Savings Plans:
    TypeDiscountCommitmentUse Case
    On-Demand0%NoneVariable workloads
    Reserved30-72%1-3 yearsSteady-state
    Savings Plans30-72%1-3 yearsFlexible compute
    Spot60-90%NoneFault-tolerant batch
  4. Enforce cost allocation tags -- Require
    Environment
    ,
    Project
    ,
    Owner
    ,
    CostCenter
    on all resources. Alert on untagged resources after 24 hours.
  5. Validate -- Re-run cost analyzer and confirm savings target achieved.

Workflow 3: Plan Disaster Recovery

  1. Select DR strategy based on RTO/RPO requirements:
    StrategyRTORPOCost
    Backup & RestoreHoursHours$
    Pilot LightMinutesMinutes$$
    Warm StandbyMinutesSeconds$$$
    Multi-Site ActiveSecondsNear-zero$$$$
  2. Configure cross-region replication -- Database replication to secondary region. S3 cross-region replication for object storage.
  3. Set up Route 53 failover routing -- Health checks on primary. Automatic DNS failover to secondary.
  4. Define backup policy:
    • Database: continuous replication, 35-day retention, cross-region, encrypted
    • Application data: daily, 90-day retention, lifecycle to IA at 30d, Glacier at 90d
    • Configuration: on-change via git + S3, unlimited retention
  5. Test --
    python scripts/dr_test.py --region us-west-2 --type failover
    and confirm RTO/RPO targets met.

Workflow 4: Audit Security Posture

  1. Run audit --
    python scripts/security_audit.py --framework cis --output report.html
  2. Review network segmentation -- Public subnets contain only NAT GW, ALB, bastion. Private subnets contain application tier. Data subnets contain RDS, Redis, Elasticsearch.
  3. Enforce least-privilege IAM -- Every policy scoped to specific resources and conditions:
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3:::my-bucket/uploads/*",
      "Condition": {
        "StringEquals": { "aws:PrincipalTag/Team": "engineering" },
        "IpAddress": { "aws:SourceIp": ["10.0.0.0/8"] }
      }
    }
    
  4. Verify encryption -- Data encrypted at rest (KMS) and in transit (TLS 1.2+).
  5. Validate -- Re-run audit and confirm all critical and high findings resolved.

AWS Well-Architected Pillars (Decision Checklist)

  • Operational Excellence: IaC everywhere? Monitoring and alerting? Runbooks for incidents?
  • Security: Least-privilege IAM? Encryption at rest and in transit? VPC segmentation?
  • Reliability: Multi-AZ? Auto-scaling? DR tested?
  • Performance: Right-sized instances? Caching layer? CDN for static assets?
  • Cost Optimization: Reserved capacity for steady-state? Spot for batch? Unused resources cleaned?
  • Sustainability: Efficient regions? Right-sized compute? Data lifecycle policies?

Reference Materials

DocumentPath
AWS Patternsreferences/aws_patterns.md
GCP Patternsreferences/gcp_patterns.md
Multi-Cloud Strategiesreferences/multi_cloud.md
Cost Optimization Guidereferences/cost_optimization.md

Troubleshooting

ProblemCauseSolution
Cross-region latency exceeds 200msNo regional caching or CDN configuredDeploy CloudFront/Cloud CDN with edge locations closest to user base; enable regional API Gateway caches
Terraform state lock conflicts across teamsShared state backend without proper lockingUse DynamoDB (AWS) or GCS (GCP) state locking with per-team state file partitioning via workspaces
Multi-cloud DNS failover not triggeringHealth check thresholds too lenient or misconfigured endpointsSet health check interval to 10s, failure threshold to 3, and verify endpoint returns 200 on the exact path monitored
IAM permission errors after cross-account migrationTrust policies not updated for new account IDsUpdate AssumeRole trust policies with correct account principals and external IDs; validate with
aws sts assume-role
Cloud costs spike unexpectedly after scaling eventAuto-scaling max limits set too high or no budget alertsSet hard max instance counts per ASG, configure billing alerts at 80%/100%/120% thresholds, and review Spot fallback behavior
VPC peering routes not propagating between cloudsRoute tables missing entries for peered CIDR rangesAdd explicit route entries in both VPCs pointing peered CIDRs to the peering connection; verify no overlapping CIDRs
DR failover test fails with data inconsistencyReplication lag between primary and secondary regionsSwitch to synchronous replication for critical databases or implement application-level consistency checks pre-failover

Success Criteria

  • 99.99% availability SLA met across all production workloads with documented uptime reports
  • Cost optimization savings above 25% compared to on-demand baseline through Reserved Instances, Savings Plans, and right-sizing
  • RTO < 15 minutes and RPO < 1 minute validated through quarterly DR failover tests
  • Zero critical CIS benchmark findings in production accounts after security audit remediation
  • Infrastructure drift < 2% measured by Terraform plan diffs on scheduled compliance scans
  • Cross-region failover completes within 60 seconds with automated Route 53 health check validation
  • 100% resource tagging compliance enforced via automated policy checks with no untagged resources older than 24 hours

Scope & Limitations

This skill covers:

  • Multi-cloud architecture design and comparison across AWS, GCP, and Azure
  • Infrastructure-as-Code with Terraform including VPC, compute, database, and networking
  • Disaster recovery planning, cross-region replication, and failover strategies
  • Cloud cost optimization, right-sizing, and reserved capacity planning

This skill does NOT cover:

  • Application-level code architecture or microservice design patterns (see
    senior-architect
    )
  • Kubernetes cluster internals, pod scheduling, or service mesh configuration (see
    senior-devops
    )
  • Security compliance frameworks beyond CIS benchmarks such as SOC 2, HIPAA, or GDPR (see
    ra-qm-team/
    compliance skills)
  • CI/CD pipeline design, build automation, or deployment workflows (see
    senior-devops
    )

Integration Points

SkillIntegrationData Flow
senior-devops
Infrastructure provisioning feeds into CI/CD deployment pipelinesTerraform outputs (endpoints, ARNs) → deployment configs
senior-secops
Security audit findings inform cloud hardening decisionsCIS benchmark results → security remediation tasks
senior-architect
Application architecture requirements drive cloud resource selectionCapacity requirements → compute/storage/network sizing
aws-solution-architect
AWS-specific deep dives complement multi-cloud strategyCloud platform comparison → AWS implementation details
ra-qm-team/soc2-compliance
Compliance requirements shape infrastructure security controlsCompliance matrices → IAM policies, encryption configs, audit logging
senior-fullstack
Fullstack application stacks deploy onto cloud infrastructureApplication stack definitions → ECS/EKS task definitions, RDS configs