Claude-skill-registry infrastructure-verification
Verify AWS infrastructure configuration before deployment. Use when validating VPC endpoints, NAT Gateway capacity, security groups, or debugging network path issues that cause Lambda connection timeouts.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/other/infrastructure-verification-awannaphasch2016-jousef-landing" ~/.claude/skills/majiayu000-claude-skill-registry-infrastructure-verification && rm -rf "$T"
skills/other/infrastructure-verification-awannaphasch2016-jousef-landing/SKILL.mdInfrastructure Verification Skill
Tech Stack: AWS CLI, Terraform, VPC, CloudWatch, bash
Source: Extracted from PDF S3 upload timeout investigation (2026-01-05) and Infrastructure-Application Contract principle.
When to Use This Skill
Use the infrastructure-verification skill when:
- ✓ Before deploying Lambda-in-VPC code
- ✓ Investigating Lambda connection timeouts
- ✓ Debugging deterministic failure patterns (first N succeed, last M fail)
- ✓ Validating network path to AWS services (S3, DynamoDB, RDS)
- ✓ After adding VPC endpoints
- ✓ Before concurrent Lambda executions
DO NOT use this skill for:
- ✗ Application code debugging (use error-investigation)
- ✗ Performance optimization (different focus)
- ✗ IAM permission issues (use AWS CLI directly)
Core Verification Principles
Principle 1: Infrastructure Dependency Validation
From CLAUDE.md Principle #15:
"Before deploying code that depends on AWS infrastructure (S3, VPC endpoints, NAT Gateway), verify infrastructure exists and is correctly configured. Network path issues cause deterministic failure patterns."
When to validate:
- Before deploying Lambda functions that make AWS service calls
- After Terraform infrastructure changes
- When investigating Lambda timeout patterns
- Before increasing concurrency limits
Principle 2: Pattern Recognition
Failure Pattern Types:
| Pattern | Root Cause | Investigation Priority |
|---|---|---|
| First N succeed, last M fail | Infrastructure bottleneck (NAT, connection limits) | HIGH - VPC endpoint missing |
| Random scattered failures | Performance issue (slow API, memory) | MEDIUM - Optimize code |
| All operations fail | Configuration issue (permissions, endpoint) | HIGH - Fix config |
| Intermittent failures | Rate limiting, transient network | LOW - Add retries |
Deterministic pattern (first N succeed, last M fail) is strongest signal of infrastructure bottleneck.
Verification Workflows
Workflow 1: VPC Endpoint Verification
Use when: Lambda-in-VPC needs to access S3 or DynamoDB
Steps:
# 1. Check if VPC endpoint exists aws ec2 describe-vpc-endpoints \ --filters "Name=vpc-id,Values=vpc-xxx" \ "Name=service-name,Values=com.amazonaws.ap-southeast-1.s3" \ --query 'VpcEndpoints[*].{ID:VpcEndpointId,State:State,Service:ServiceName}' \ --output table # Expected output (if endpoint exists): # ----------------------------------------- # | DescribeVpcEndpoints | # +-------+-------+------------------------+ # | ID | State | Service | # +-------+-------+------------------------+ # | vpce-xxx | available | com.amazonaws.ap-southeast-1.s3 | # +-------+-------+------------------------+ # If empty → No S3 VPC Endpoint (traffic goes through NAT Gateway) # 2. Verify endpoint state aws ec2 describe-vpc-endpoints \ --vpc-endpoint-ids vpce-xxx \ --query 'VpcEndpoints[0].State' \ --output text # Expected: "available" # If "pending" → Wait for creation # If "failed" → Check Terraform logs # 3. Verify route table attachment aws ec2 describe-vpc-endpoints \ --vpc-endpoint-ids vpce-xxx \ --query 'VpcEndpoints[0].RouteTableIds' \ --output table # Expected: List of route table IDs (must include Lambda subnet route tables) # 4. Check Lambda subnet route tables aws lambda get-function-configuration \ --function-name my-function \ --query 'VpcConfig.SubnetIds' \ --output text | xargs -I {} aws ec2 describe-subnets --subnet-ids {} # Compare: Lambda subnets' route tables should be in VPC endpoint's RouteTableIds # 5. Verify S3 prefix list in route tables ROUTE_TABLE_ID=$(aws ec2 describe-route-tables \ --filters "Name=vpc-id,Values=vpc-xxx" \ --query 'RouteTables[0].RouteTableId' \ --output text) aws ec2 describe-route-tables \ --route-table-ids $ROUTE_TABLE_ID \ --query 'RouteTables[*].Routes[?GatewayId==`vpce-xxx`]' # Expected: Route with DestinationPrefixListId (S3 prefix list)
Verification checklist:
- VPC endpoint exists (
returns result)describe-vpc-endpoints - State is "available" (not "pending" or "failed")
- Route tables attached (includes Lambda subnet route tables)
- S3 prefix list routes created (check route tables)
Common issues:
- Missing VPC endpoint → Create with Terraform
- State "pending" → Wait 2-3 minutes
- Route tables not attached → Update Terraform
route_table_ids - Lambda subnets not covered → Verify subnet route table IDs
Workflow 2: NAT Gateway Diagnosis
Use when: Investigating Lambda connection timeouts with external services
Steps:
# 1. Check NAT Gateway exists aws ec2 describe-nat-gateways \ --filter "Name=vpc-id,Values=vpc-xxx" \ --query 'NatGateways[*].{ID:NatGatewayId,State:State,PublicIp:NatGatewayAddresses[0].PublicIp}' \ --output table # Expected: State "available" # 2. Check route tables using NAT Gateway aws ec2 describe-route-tables \ --filters "Name=vpc-id,Values=vpc-xxx" \ --query 'RouteTables[*].Routes[?NatGatewayId!=`null`].[RouteTableId,DestinationCidrBlock,NatGatewayId]' \ --output table # Expected: Route 0.0.0.0/0 → nat-xxx (default route through NAT) # 3. Analyze connection saturation pattern # Run this during concurrent Lambda executions aws logs filter-log-events \ --log-group-name /aws/lambda/my-function \ --start-time $(date -d '5 minutes ago' +%s)000 \ --filter-pattern "START RequestId" \ --query 'events[*].timestamp' \ --output text | xargs -n1 date -d @ # Check execution pattern: # - All start within 1 second → Concurrent execution # - Some timeout after 600s → NAT Gateway saturation # 4. Check for connection timeout errors aws logs filter-log-events \ --log-group-name /aws/lambda/my-function \ --filter-pattern "ConnectTimeoutError" \ --query 'events[*].message' \ --output text # If errors found → NAT Gateway connection limit reached # 5. Calculate concurrent connection demand CONCURRENT_LAMBDAS=$(aws logs filter-log-events \ --log-group-name /aws/lambda/my-function \ --start-time $(date -d '1 minute ago' +%s)000 \ --filter-pattern "START RequestId" \ --query 'length(events)' \ --output text) echo "Concurrent Lambdas: $CONCURRENT_LAMBDAS" echo "NAT Gateway connection limit: ~55,000 (but establishment rate limited)"
NAT Gateway saturation indicators:
- ✅ Deterministic pattern (first N succeed, last M fail)
- ✅ ConnectTimeoutError in logs
- ✅ Long execution times (600s = boto3 default timeout)
- ✅ Timeline shows concurrent starts → split success/failure
Solution: Add VPC Gateway Endpoint for S3/DynamoDB to bypass NAT
Workflow 3: Network Path Validation
Use when: Verifying Lambda can reach AWS services
Steps:
# 1. Identify Lambda VPC configuration aws lambda get-function-configuration \ --function-name my-function \ --query 'VpcConfig.{VpcId:VpcId,SubnetIds:SubnetIds,SecurityGroupIds:SecurityGroupIds}' \ --output json # Save VPC ID, Subnet IDs, Security Group IDs # 2. Check security group egress rules aws ec2 describe-security-groups \ --group-ids sg-xxx \ --query 'SecurityGroups[*].IpPermissionsEgress[*].{Proto:IpProtocol,Port:FromPort,Dest:IpRanges[0].CidrIp}' \ --output table # Expected: 0.0.0.0/0 allowed (all egress) # If restricted → Add rule for destination service # 3. Check route table for Lambda subnet SUBNET_ID=$(aws lambda get-function-configuration \ --function-name my-function \ --query 'VpcConfig.SubnetIds[0]' \ --output text) ROUTE_TABLE_ID=$(aws ec2 describe-route-tables \ --filters "Name=association.subnet-id,Values=$SUBNET_ID" \ --query 'RouteTables[0].RouteTableId' \ --output text) aws ec2 describe-route-tables \ --route-table-ids $ROUTE_TABLE_ID \ --query 'RouteTables[*].Routes[*].[DestinationCidrBlock,GatewayId,NatGatewayId]' \ --output table # Expected routes: # - local → vpc-xxx (VPC internal) # - 0.0.0.0/0 → nat-xxx (internet via NAT) OR vpce-xxx (S3 via endpoint) # 4. Test actual network path (requires test Lambda invocation) # Deploy temporary test Lambda: # - Attempts connection to S3 # - Logs connection details # - Reports success/failure # 5. Analyze test results aws logs tail /aws/lambda/network-test --since 1m # Look for: # - Connection established (success) # - Connection timeout (NAT saturated) # - Connection refused (security group blocked) # - DNS resolution failure (VPC DNS issue)
Network path checklist:
- Lambda in VPC (
not empty)VpcConfig - Security group allows egress to destination
- Route table has path to destination (NAT or VPC endpoint)
- VPC endpoint exists for AWS service (S3, DynamoDB)
- Test invocation confirms connectivity
Workflow 4: Post-Deployment Infrastructure Validation
Use when: After deploying infrastructure changes (VPC endpoints, security groups)
Steps:
# 1. Verify Terraform outputs cd terraform terraform output s3_vpc_endpoint_id # Should return vpce-xxx terraform output s3_vpc_endpoint_state # Should return "available" # 2. Run smoke test Lambda invocation aws lambda invoke \ --function-name my-function \ --payload '{"test": true}' \ /tmp/response.json # Check response cat /tmp/response.json | jq . # 3. Verify CloudWatch logs show success aws logs tail /aws/lambda/my-function --since 1m --follow # Expected: # - No ConnectTimeoutError # - Operation completes in expected time (2-3s not 600s) # - Success message logged # 4. Test concurrent execution (simulate production load) for i in {1..10}; do aws lambda invoke \ --function-name my-function \ --payload "{\"id\": $i}" \ --invocation-type Event \ /tmp/response_$i.json & done wait # 5. Analyze concurrent execution results aws logs filter-log-events \ --log-group-name /aws/lambda/my-function \ --start-time $(date -d '5 minutes ago' +%s)000 \ --filter-pattern "ConnectTimeoutError" \ --query 'length(events)' \ --output text # Expected: 0 (no timeout errors) # If > 0 → Infrastructure issue still exists # 6. Verify 100% success rate aws logs filter-log-events \ --log-group-name /aws/lambda/my-function \ --start-time $(date -d '5 minutes ago' +%s)000 \ --filter-pattern "✅" \ --query 'length(events)' \ --output text # Expected: 10 (all concurrent executions succeeded)
Post-deployment checklist:
- Terraform outputs confirm resource created
- Smoke test invocation succeeds
- CloudWatch logs show no errors
- Concurrent execution test (10+ invocations)
- 100% success rate (no timeouts)
- Execution time within expected range (2-3s not 600s)
Common Infrastructure Issues
Issue 1: Missing S3 VPC Endpoint
Symptom:
- Lambda timeout after 600s
- Error:
ConnectTimeoutError: Connect timeout on endpoint URL: "https://bucket.s3.region.amazonaws.com/..." - Pattern: First N concurrent operations succeed, last M timeout
Diagnosis:
# Check for S3 VPC endpoint aws ec2 describe-vpc-endpoints \ --filters "Name=vpc-id,Values=vpc-xxx" \ "Name=service-name,Values=com.amazonaws.region.s3" # If empty → No endpoint (S3 traffic goes through NAT)
Fix:
# terraform/s3_vpc_endpoint.tf data "aws_route_tables" "vpc_route_tables" { vpc_id = data.aws_vpc.default.id } resource "aws_vpc_endpoint" "s3" { vpc_id = data.aws_vpc.default.id service_name = "com.amazonaws.${var.aws_region}.s3" vpc_endpoint_type = "Gateway" route_table_ids = data.aws_route_tables.vpc_route_tables.ids policy = jsonencode({ Version = "2012-10-17" Statement = [{ Effect = "Allow" Principal = "*" Action = "s3:*" Resource = "*" }] }) tags = { Name = "s3-endpoint" } } output "s3_vpc_endpoint_id" { value = aws_vpc_endpoint.s3.id } output "s3_vpc_endpoint_state" { value = aws_vpc_endpoint.s3.state }
Verification:
cd terraform terraform apply terraform output s3_vpc_endpoint_state # Should be "available" # Test Lambda invocation aws lambda invoke --function-name my-function --payload '{}' /tmp/response.json aws logs tail /aws/lambda/my-function --since 1m # Expected: No timeout, completes in 2-3s
Issue 2: NAT Gateway Connection Saturation
Symptom:
- Deterministic failure pattern (first 5 succeed, last 5 timeout)
- All timeouts occur after ~10 minutes (boto3 default + retries)
- Timeline analysis shows concurrent Lambda starts
Diagnosis:
# Check timeline of Lambda executions aws logs filter-log-events \ --log-group-name /aws/lambda/my-function \ --start-time $(date -d '30 minutes ago' +%s)000 \ --filter-pattern "START RequestId" \ | jq -r '.events[] | .timestamp as $ts | ($ts/1000 | strftime("%H:%M:%S")) + " " + (.message | split(" ")[2])' # Look for: # - All start within 1 second (concurrent) # - Check which RequestIds have errors aws logs filter-log-events \ --log-group-name /aws/lambda/my-function \ --filter-pattern "ConnectTimeoutError" \ | jq -r '.events[].message' | grep -o "RequestId: [a-z0-9-]*" # Pattern: Last N RequestIds consistently fail
Root Cause:
- NAT Gateway has limited connection establishment rate
- Concurrent Lambdas try to establish S3 connections simultaneously
- First N connections succeed → Upload completes in 2-3s
- Last M connections queued → Eventually timeout after 600s
Fix: Add S3 VPC Gateway Endpoint (see Issue 1)
Why this works:
- VPC Gateway Endpoint bypasses NAT Gateway
- S3 traffic routed directly within AWS network
- No connection establishment limits
- Free (Gateway endpoints have no hourly charge)
Issue 3: Security Group Blocking Egress
Symptom:
- Lambda unable to connect to AWS service
- Error: Connection refused or timeout
- All invocations fail (not deterministic pattern)
Diagnosis:
# Check security group egress rules aws lambda get-function-configuration \ --function-name my-function \ --query 'VpcConfig.SecurityGroupIds[0]' \ --output text | xargs -I {} aws ec2 describe-security-groups --group-ids {} # Look for egress rules allowing HTTPS (port 443) # Expected: 0.0.0.0/0 or specific AWS service prefix list
Fix:
# terraform/security_groups.tf resource "aws_security_group_rule" "lambda_egress_https" { type = "egress" from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] security_group_id = aws_security_group.lambda.id }
Issue 4: Route Table Not Attached to VPC Endpoint
Symptom:
- VPC endpoint exists and is "available"
- Lambda still times out connecting to S3
- Deterministic or random failures
Diagnosis:
# Check VPC endpoint route table attachment aws ec2 describe-vpc-endpoints \ --vpc-endpoint-ids vpce-xxx \ --query 'VpcEndpoints[0].RouteTableIds' \ --output table # Get Lambda subnet route table aws lambda get-function-configuration \ --function-name my-function \ --query 'VpcConfig.SubnetIds[0]' \ --output text | xargs -I {} aws ec2 describe-route-tables \ --filters "Name=association.subnet-id,Values={}" \ --query 'RouteTables[0].RouteTableId' \ --output text # Compare: Lambda's route table should be in endpoint's RouteTableIds
Fix:
# terraform/s3_vpc_endpoint.tf data "aws_route_tables" "vpc_route_tables" { vpc_id = data.aws_vpc.default.id } resource "aws_vpc_endpoint" "s3" { # ... other config ... # Attach to ALL route tables (includes Lambda subnets) route_table_ids = data.aws_route_tables.vpc_route_tables.ids }
Integration with Other Skills
With error-investigation
- Use infrastructure-verification BEFORE error-investigation when:
- Investigating Lambda timeout patterns
- Debugging connection failures
- Analyzing deterministic failure patterns
- Use error-investigation AFTER infrastructure-verification when:
- Infrastructure confirmed correct but errors persist
- Need to analyze application logs
- Debugging business logic failures
With deployment skill
- Use infrastructure-verification:
- BEFORE deploying Lambda-in-VPC code
- AFTER deploying infrastructure changes (Terraform apply)
- During post-deployment validation
- Complements deployment smoke tests with infrastructure-specific checks
With testing-workflow
- Infrastructure verification is a form of pre-deployment testing
- Validates infrastructure-application contract (CLAUDE.md Principle #15)
- Catches configuration issues before code deployment
Quick Reference
VPC Endpoint Types
| Type | Services | Cost | Use Case |
|---|---|---|---|
| Gateway | S3, DynamoDB | FREE | High-throughput data access |
| Interface | Most AWS services | ~$7.50/month | Other services (Secrets Manager, etc.) |
NAT Gateway Limits
| Limit | Value | Impact |
|---|---|---|
| Concurrent connections | 55,000 | Theoretical max |
| Connection establishment rate | Limited | Causes saturation with concurrent Lambdas |
| Data transfer cost | $0.045/GB | Expensive for large transfers |
Recommendation: Use VPC Gateway Endpoints for S3/DynamoDB (free, unlimited, faster)
Common AWS CLI Commands
# VPC endpoint aws ec2 describe-vpc-endpoints --vpc-endpoint-ids vpce-xxx # NAT Gateway aws ec2 describe-nat-gateways --nat-gateway-ids nat-xxx # Security groups aws ec2 describe-security-groups --group-ids sg-xxx # Route tables aws ec2 describe-route-tables --route-table-ids rtb-xxx # Lambda VPC config aws lambda get-function-configuration --function-name my-function --query 'VpcConfig'
File Organization
.claude/skills/infrastructure-verification/ └── SKILL.md # This file (complete skill)
References
- VPC Endpoints
- NAT Gateway
- Lambda VPC Configuration
- Bug Hunt Report - Real-world investigation
- CLAUDE.md Principle #15 (Infrastructure-Application Contract)
- CLAUDE.md Principle #2 (Progressive Evidence Strengthening)