Claude-skill-registry infrastructure-verification

Verify AWS infrastructure configuration before deployment. Use when validating VPC endpoints, NAT Gateway capacity, security groups, or debugging network path issues that cause Lambda connection timeouts.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/other/infrastructure-verification-awannaphasch2016-jousef-landing" ~/.claude/skills/majiayu000-claude-skill-registry-infrastructure-verification && rm -rf "$T"
manifest: skills/other/infrastructure-verification-awannaphasch2016-jousef-landing/SKILL.md
source content

Infrastructure Verification Skill

Tech Stack: AWS CLI, Terraform, VPC, CloudWatch, bash

Source: Extracted from PDF S3 upload timeout investigation (2026-01-05) and Infrastructure-Application Contract principle.


When to Use This Skill

Use the infrastructure-verification skill when:

  • ✓ Before deploying Lambda-in-VPC code
  • ✓ Investigating Lambda connection timeouts
  • ✓ Debugging deterministic failure patterns (first N succeed, last M fail)
  • ✓ Validating network path to AWS services (S3, DynamoDB, RDS)
  • ✓ After adding VPC endpoints
  • ✓ Before concurrent Lambda executions

DO NOT use this skill for:

  • ✗ Application code debugging (use error-investigation)
  • ✗ Performance optimization (different focus)
  • ✗ IAM permission issues (use AWS CLI directly)

Core Verification Principles

Principle 1: Infrastructure Dependency Validation

From CLAUDE.md Principle #15:

"Before deploying code that depends on AWS infrastructure (S3, VPC endpoints, NAT Gateway), verify infrastructure exists and is correctly configured. Network path issues cause deterministic failure patterns."

When to validate:

  • Before deploying Lambda functions that make AWS service calls
  • After Terraform infrastructure changes
  • When investigating Lambda timeout patterns
  • Before increasing concurrency limits

Principle 2: Pattern Recognition

Failure Pattern Types:

PatternRoot CauseInvestigation Priority
First N succeed, last M failInfrastructure bottleneck (NAT, connection limits)HIGH - VPC endpoint missing
Random scattered failuresPerformance issue (slow API, memory)MEDIUM - Optimize code
All operations failConfiguration issue (permissions, endpoint)HIGH - Fix config
Intermittent failuresRate limiting, transient networkLOW - Add retries

Deterministic pattern (first N succeed, last M fail) is strongest signal of infrastructure bottleneck.


Verification Workflows

Workflow 1: VPC Endpoint Verification

Use when: Lambda-in-VPC needs to access S3 or DynamoDB

Steps:

# 1. Check if VPC endpoint exists
aws ec2 describe-vpc-endpoints \
  --filters "Name=vpc-id,Values=vpc-xxx" \
            "Name=service-name,Values=com.amazonaws.ap-southeast-1.s3" \
  --query 'VpcEndpoints[*].{ID:VpcEndpointId,State:State,Service:ServiceName}' \
  --output table

# Expected output (if endpoint exists):
# -----------------------------------------
# | DescribeVpcEndpoints                  |
# +-------+-------+------------------------+
# | ID    | State | Service                |
# +-------+-------+------------------------+
# | vpce-xxx | available | com.amazonaws.ap-southeast-1.s3 |
# +-------+-------+------------------------+

# If empty → No S3 VPC Endpoint (traffic goes through NAT Gateway)

# 2. Verify endpoint state
aws ec2 describe-vpc-endpoints \
  --vpc-endpoint-ids vpce-xxx \
  --query 'VpcEndpoints[0].State' \
  --output text

# Expected: "available"
# If "pending" → Wait for creation
# If "failed" → Check Terraform logs

# 3. Verify route table attachment
aws ec2 describe-vpc-endpoints \
  --vpc-endpoint-ids vpce-xxx \
  --query 'VpcEndpoints[0].RouteTableIds' \
  --output table

# Expected: List of route table IDs (must include Lambda subnet route tables)

# 4. Check Lambda subnet route tables
aws lambda get-function-configuration \
  --function-name my-function \
  --query 'VpcConfig.SubnetIds' \
  --output text | xargs -I {} aws ec2 describe-subnets --subnet-ids {}

# Compare: Lambda subnets' route tables should be in VPC endpoint's RouteTableIds

# 5. Verify S3 prefix list in route tables
ROUTE_TABLE_ID=$(aws ec2 describe-route-tables \
  --filters "Name=vpc-id,Values=vpc-xxx" \
  --query 'RouteTables[0].RouteTableId' \
  --output text)

aws ec2 describe-route-tables \
  --route-table-ids $ROUTE_TABLE_ID \
  --query 'RouteTables[*].Routes[?GatewayId==`vpce-xxx`]'

# Expected: Route with DestinationPrefixListId (S3 prefix list)

Verification checklist:

  • VPC endpoint exists (
    describe-vpc-endpoints
    returns result)
  • State is "available" (not "pending" or "failed")
  • Route tables attached (includes Lambda subnet route tables)
  • S3 prefix list routes created (check route tables)

Common issues:

  • Missing VPC endpoint → Create with Terraform
  • State "pending" → Wait 2-3 minutes
  • Route tables not attached → Update Terraform
    route_table_ids
  • Lambda subnets not covered → Verify subnet route table IDs

Workflow 2: NAT Gateway Diagnosis

Use when: Investigating Lambda connection timeouts with external services

Steps:

# 1. Check NAT Gateway exists
aws ec2 describe-nat-gateways \
  --filter "Name=vpc-id,Values=vpc-xxx" \
  --query 'NatGateways[*].{ID:NatGatewayId,State:State,PublicIp:NatGatewayAddresses[0].PublicIp}' \
  --output table

# Expected: State "available"

# 2. Check route tables using NAT Gateway
aws ec2 describe-route-tables \
  --filters "Name=vpc-id,Values=vpc-xxx" \
  --query 'RouteTables[*].Routes[?NatGatewayId!=`null`].[RouteTableId,DestinationCidrBlock,NatGatewayId]' \
  --output table

# Expected: Route 0.0.0.0/0 → nat-xxx (default route through NAT)

# 3. Analyze connection saturation pattern
# Run this during concurrent Lambda executions
aws logs filter-log-events \
  --log-group-name /aws/lambda/my-function \
  --start-time $(date -d '5 minutes ago' +%s)000 \
  --filter-pattern "START RequestId" \
  --query 'events[*].timestamp' \
  --output text | xargs -n1 date -d @

# Check execution pattern:
# - All start within 1 second → Concurrent execution
# - Some timeout after 600s → NAT Gateway saturation

# 4. Check for connection timeout errors
aws logs filter-log-events \
  --log-group-name /aws/lambda/my-function \
  --filter-pattern "ConnectTimeoutError" \
  --query 'events[*].message' \
  --output text

# If errors found → NAT Gateway connection limit reached

# 5. Calculate concurrent connection demand
CONCURRENT_LAMBDAS=$(aws logs filter-log-events \
  --log-group-name /aws/lambda/my-function \
  --start-time $(date -d '1 minute ago' +%s)000 \
  --filter-pattern "START RequestId" \
  --query 'length(events)' \
  --output text)

echo "Concurrent Lambdas: $CONCURRENT_LAMBDAS"
echo "NAT Gateway connection limit: ~55,000 (but establishment rate limited)"

NAT Gateway saturation indicators:

  • ✅ Deterministic pattern (first N succeed, last M fail)
  • ✅ ConnectTimeoutError in logs
  • ✅ Long execution times (600s = boto3 default timeout)
  • ✅ Timeline shows concurrent starts → split success/failure

Solution: Add VPC Gateway Endpoint for S3/DynamoDB to bypass NAT

Workflow 3: Network Path Validation

Use when: Verifying Lambda can reach AWS services

Steps:

# 1. Identify Lambda VPC configuration
aws lambda get-function-configuration \
  --function-name my-function \
  --query 'VpcConfig.{VpcId:VpcId,SubnetIds:SubnetIds,SecurityGroupIds:SecurityGroupIds}' \
  --output json

# Save VPC ID, Subnet IDs, Security Group IDs

# 2. Check security group egress rules
aws ec2 describe-security-groups \
  --group-ids sg-xxx \
  --query 'SecurityGroups[*].IpPermissionsEgress[*].{Proto:IpProtocol,Port:FromPort,Dest:IpRanges[0].CidrIp}' \
  --output table

# Expected: 0.0.0.0/0 allowed (all egress)
# If restricted → Add rule for destination service

# 3. Check route table for Lambda subnet
SUBNET_ID=$(aws lambda get-function-configuration \
  --function-name my-function \
  --query 'VpcConfig.SubnetIds[0]' \
  --output text)

ROUTE_TABLE_ID=$(aws ec2 describe-route-tables \
  --filters "Name=association.subnet-id,Values=$SUBNET_ID" \
  --query 'RouteTables[0].RouteTableId' \
  --output text)

aws ec2 describe-route-tables \
  --route-table-ids $ROUTE_TABLE_ID \
  --query 'RouteTables[*].Routes[*].[DestinationCidrBlock,GatewayId,NatGatewayId]' \
  --output table

# Expected routes:
# - local → vpc-xxx (VPC internal)
# - 0.0.0.0/0 → nat-xxx (internet via NAT) OR vpce-xxx (S3 via endpoint)

# 4. Test actual network path (requires test Lambda invocation)
# Deploy temporary test Lambda:
# - Attempts connection to S3
# - Logs connection details
# - Reports success/failure

# 5. Analyze test results
aws logs tail /aws/lambda/network-test --since 1m

# Look for:
# - Connection established (success)
# - Connection timeout (NAT saturated)
# - Connection refused (security group blocked)
# - DNS resolution failure (VPC DNS issue)

Network path checklist:

  • Lambda in VPC (
    VpcConfig
    not empty)
  • Security group allows egress to destination
  • Route table has path to destination (NAT or VPC endpoint)
  • VPC endpoint exists for AWS service (S3, DynamoDB)
  • Test invocation confirms connectivity

Workflow 4: Post-Deployment Infrastructure Validation

Use when: After deploying infrastructure changes (VPC endpoints, security groups)

Steps:

# 1. Verify Terraform outputs
cd terraform
terraform output s3_vpc_endpoint_id        # Should return vpce-xxx
terraform output s3_vpc_endpoint_state     # Should return "available"

# 2. Run smoke test Lambda invocation
aws lambda invoke \
  --function-name my-function \
  --payload '{"test": true}' \
  /tmp/response.json

# Check response
cat /tmp/response.json | jq .

# 3. Verify CloudWatch logs show success
aws logs tail /aws/lambda/my-function --since 1m --follow

# Expected:
# - No ConnectTimeoutError
# - Operation completes in expected time (2-3s not 600s)
# - Success message logged

# 4. Test concurrent execution (simulate production load)
for i in {1..10}; do
  aws lambda invoke \
    --function-name my-function \
    --payload "{\"id\": $i}" \
    --invocation-type Event \
    /tmp/response_$i.json &
done
wait

# 5. Analyze concurrent execution results
aws logs filter-log-events \
  --log-group-name /aws/lambda/my-function \
  --start-time $(date -d '5 minutes ago' +%s)000 \
  --filter-pattern "ConnectTimeoutError" \
  --query 'length(events)' \
  --output text

# Expected: 0 (no timeout errors)
# If > 0 → Infrastructure issue still exists

# 6. Verify 100% success rate
aws logs filter-log-events \
  --log-group-name /aws/lambda/my-function \
  --start-time $(date -d '5 minutes ago' +%s)000 \
  --filter-pattern "✅" \
  --query 'length(events)' \
  --output text

# Expected: 10 (all concurrent executions succeeded)

Post-deployment checklist:

  • Terraform outputs confirm resource created
  • Smoke test invocation succeeds
  • CloudWatch logs show no errors
  • Concurrent execution test (10+ invocations)
  • 100% success rate (no timeouts)
  • Execution time within expected range (2-3s not 600s)

Common Infrastructure Issues

Issue 1: Missing S3 VPC Endpoint

Symptom:

  • Lambda timeout after 600s
  • Error:
    ConnectTimeoutError: Connect timeout on endpoint URL: "https://bucket.s3.region.amazonaws.com/..."
  • Pattern: First N concurrent operations succeed, last M timeout

Diagnosis:

# Check for S3 VPC endpoint
aws ec2 describe-vpc-endpoints \
  --filters "Name=vpc-id,Values=vpc-xxx" \
            "Name=service-name,Values=com.amazonaws.region.s3"

# If empty → No endpoint (S3 traffic goes through NAT)

Fix:

# terraform/s3_vpc_endpoint.tf
data "aws_route_tables" "vpc_route_tables" {
  vpc_id = data.aws_vpc.default.id
}

resource "aws_vpc_endpoint" "s3" {
  vpc_id            = data.aws_vpc.default.id
  service_name      = "com.amazonaws.${var.aws_region}.s3"
  vpc_endpoint_type = "Gateway"

  route_table_ids = data.aws_route_tables.vpc_route_tables.ids

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = "*"
      Action    = "s3:*"
      Resource  = "*"
    }]
  })

  tags = {
    Name = "s3-endpoint"
  }
}

output "s3_vpc_endpoint_id" {
  value = aws_vpc_endpoint.s3.id
}

output "s3_vpc_endpoint_state" {
  value = aws_vpc_endpoint.s3.state
}

Verification:

cd terraform
terraform apply
terraform output s3_vpc_endpoint_state  # Should be "available"

# Test Lambda invocation
aws lambda invoke --function-name my-function --payload '{}' /tmp/response.json
aws logs tail /aws/lambda/my-function --since 1m
# Expected: No timeout, completes in 2-3s

Issue 2: NAT Gateway Connection Saturation

Symptom:

  • Deterministic failure pattern (first 5 succeed, last 5 timeout)
  • All timeouts occur after ~10 minutes (boto3 default + retries)
  • Timeline analysis shows concurrent Lambda starts

Diagnosis:

# Check timeline of Lambda executions
aws logs filter-log-events \
  --log-group-name /aws/lambda/my-function \
  --start-time $(date -d '30 minutes ago' +%s)000 \
  --filter-pattern "START RequestId" \
  | jq -r '.events[] | .timestamp as $ts | ($ts/1000 | strftime("%H:%M:%S")) + " " + (.message | split(" ")[2])'

# Look for:
# - All start within 1 second (concurrent)
# - Check which RequestIds have errors
aws logs filter-log-events \
  --log-group-name /aws/lambda/my-function \
  --filter-pattern "ConnectTimeoutError" \
  | jq -r '.events[].message' | grep -o "RequestId: [a-z0-9-]*"

# Pattern: Last N RequestIds consistently fail

Root Cause:

  • NAT Gateway has limited connection establishment rate
  • Concurrent Lambdas try to establish S3 connections simultaneously
  • First N connections succeed → Upload completes in 2-3s
  • Last M connections queued → Eventually timeout after 600s

Fix: Add S3 VPC Gateway Endpoint (see Issue 1)

Why this works:

  • VPC Gateway Endpoint bypasses NAT Gateway
  • S3 traffic routed directly within AWS network
  • No connection establishment limits
  • Free (Gateway endpoints have no hourly charge)

Issue 3: Security Group Blocking Egress

Symptom:

  • Lambda unable to connect to AWS service
  • Error: Connection refused or timeout
  • All invocations fail (not deterministic pattern)

Diagnosis:

# Check security group egress rules
aws lambda get-function-configuration \
  --function-name my-function \
  --query 'VpcConfig.SecurityGroupIds[0]' \
  --output text | xargs -I {} aws ec2 describe-security-groups --group-ids {}

# Look for egress rules allowing HTTPS (port 443)
# Expected: 0.0.0.0/0 or specific AWS service prefix list

Fix:

# terraform/security_groups.tf
resource "aws_security_group_rule" "lambda_egress_https" {
  type              = "egress"
  from_port         = 443
  to_port           = 443
  protocol          = "tcp"
  cidr_blocks       = ["0.0.0.0/0"]
  security_group_id = aws_security_group.lambda.id
}

Issue 4: Route Table Not Attached to VPC Endpoint

Symptom:

  • VPC endpoint exists and is "available"
  • Lambda still times out connecting to S3
  • Deterministic or random failures

Diagnosis:

# Check VPC endpoint route table attachment
aws ec2 describe-vpc-endpoints \
  --vpc-endpoint-ids vpce-xxx \
  --query 'VpcEndpoints[0].RouteTableIds' \
  --output table

# Get Lambda subnet route table
aws lambda get-function-configuration \
  --function-name my-function \
  --query 'VpcConfig.SubnetIds[0]' \
  --output text | xargs -I {} aws ec2 describe-route-tables \
  --filters "Name=association.subnet-id,Values={}" \
  --query 'RouteTables[0].RouteTableId' \
  --output text

# Compare: Lambda's route table should be in endpoint's RouteTableIds

Fix:

# terraform/s3_vpc_endpoint.tf
data "aws_route_tables" "vpc_route_tables" {
  vpc_id = data.aws_vpc.default.id
}

resource "aws_vpc_endpoint" "s3" {
  # ... other config ...

  # Attach to ALL route tables (includes Lambda subnets)
  route_table_ids = data.aws_route_tables.vpc_route_tables.ids
}

Integration with Other Skills

With error-investigation

  • Use infrastructure-verification BEFORE error-investigation when:
    • Investigating Lambda timeout patterns
    • Debugging connection failures
    • Analyzing deterministic failure patterns
  • Use error-investigation AFTER infrastructure-verification when:
    • Infrastructure confirmed correct but errors persist
    • Need to analyze application logs
    • Debugging business logic failures

With deployment skill

  • Use infrastructure-verification:
    • BEFORE deploying Lambda-in-VPC code
    • AFTER deploying infrastructure changes (Terraform apply)
    • During post-deployment validation
  • Complements deployment smoke tests with infrastructure-specific checks

With testing-workflow

  • Infrastructure verification is a form of pre-deployment testing
  • Validates infrastructure-application contract (CLAUDE.md Principle #15)
  • Catches configuration issues before code deployment

Quick Reference

VPC Endpoint Types

TypeServicesCostUse Case
GatewayS3, DynamoDBFREEHigh-throughput data access
InterfaceMost AWS services~$7.50/monthOther services (Secrets Manager, etc.)

NAT Gateway Limits

LimitValueImpact
Concurrent connections55,000Theoretical max
Connection establishment rateLimitedCauses saturation with concurrent Lambdas
Data transfer cost$0.045/GBExpensive for large transfers

Recommendation: Use VPC Gateway Endpoints for S3/DynamoDB (free, unlimited, faster)

Common AWS CLI Commands

# VPC endpoint
aws ec2 describe-vpc-endpoints --vpc-endpoint-ids vpce-xxx

# NAT Gateway
aws ec2 describe-nat-gateways --nat-gateway-ids nat-xxx

# Security groups
aws ec2 describe-security-groups --group-ids sg-xxx

# Route tables
aws ec2 describe-route-tables --route-table-ids rtb-xxx

# Lambda VPC config
aws lambda get-function-configuration --function-name my-function --query 'VpcConfig'

File Organization

.claude/skills/infrastructure-verification/
└── SKILL.md              # This file (complete skill)

References