Claude-Skills senior-devops
git clone https://github.com/borghei/Claude-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/borghei/Claude-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/engineering/senior-devops" ~/.claude/skills/borghei-claude-skills-senior-devops && rm -rf "$T"
engineering/senior-devops/SKILL.mdSenior DevOps Engineer
The agent generates CI/CD pipelines, scaffolds Terraform infrastructure, and manages deployments with strategy selection, health checks, and rollback support.
Quick Start
# Generate CI/CD pipeline from project analysis python scripts/pipeline_generator.py <project-path> --platform github-actions --verbose # Scaffold Terraform infrastructure python scripts/terraform_scaffolder.py <target-path> --provider aws --env production --verbose # Manage deployment with canary strategy python scripts/deployment_manager.py <target-path> --strategy canary --verbose
Tools Overview
| Tool | Input | Output |
|---|---|---|
| Project path | CI/CD pipeline config (GitHub Actions, GitLab CI, Jenkins, CircleCI) |
| Target path + provider | Terraform module structure with state config |
| Target path + strategy | Deployment plan with health checks and rollback |
All tools support
--json for machine-readable output and --output / -o for file writing.
Workflow 1: Containerize and Deploy
Step 1 -- Build a production Dockerfile.
The agent generates multi-stage Dockerfiles following this pattern:
# Stage 1: Build FROM node:20-alpine AS builder WORKDIR /app COPY package.json package-lock.json ./ RUN npm ci --only=production && npm cache clean --force COPY . . RUN npm run build # Stage 2: Production FROM node:20-alpine AS production WORKDIR /app RUN addgroup -g 1001 appgroup && \ adduser -u 1001 -G appgroup -s /bin/sh -D appuser COPY --from=builder --chown=appuser:appgroup /app/dist ./dist COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules COPY --from=builder --chown=appuser:appgroup /app/package.json ./ USER appuser EXPOSE 3000 HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \ CMD wget --no-verbose --tries=1 --spider http://localhost:3000/healthz || exit 1 CMD ["node", "dist/server.js"]
Validation checkpoint: Image builds with
docker build -t app:test . and docker run --rm app:test returns healthy.
Step 2 -- Deploy to Kubernetes.
The agent creates a Deployment with probes, resource limits, and security context:
spec: containers: - name: app image: myapp:1.2.3 resources: requests: { cpu: 250m, memory: 256Mi } limits: { cpu: "1", memory: 512Mi } livenessProbe: httpGet: { path: /healthz, port: 3000 } initialDelaySeconds: 15 periodSeconds: 20 readinessProbe: httpGet: { path: /ready, port: 3000 } initialDelaySeconds: 5 periodSeconds: 10 startupProbe: httpGet: { path: /healthz, port: 3000 } failureThreshold: 30 periodSeconds: 10
Probe decision:
- startupProbe: Slow-starting apps (JVM, model loading). Prevents liveness from killing during startup.
- livenessProbe: Detects deadlocks. Keep simple -- do not check downstream dependencies.
- readinessProbe: Controls traffic routing. Include dependency checks here.
Validation checkpoint:
kubectl get pods -l app=myapp shows all pods Running and Ready.
Workflow 2: Infrastructure as Code with Terraform
Step 1 -- Scaffold the module structure.
python scripts/terraform_scaffolder.py ./infrastructure --provider aws --env production --verbose
The agent produces:
infrastructure/ modules/ vpc/ # main.tf, variables.tf, outputs.tf eks/ rds/ environments/ staging/ # main.tf, terraform.tfvars, backend.tf production/
Step 2 -- Configure remote state.
terraform { backend "s3" { bucket = "mycompany-terraform-state" key = "production/infrastructure.tfstate" region = "us-east-1" dynamodb_table = "terraform-locks" encrypt = true } }
Step 3 -- Run drift detection in CI.
terraform plan -detailed-exitcode -out=plan.tfplan # Exit 0 = clean, Exit 1 = error, Exit 2 = drift detected
Validation checkpoint:
terraform plan shows no unexpected changes. Drift alerts fire within 24 hours.
Key rules:
- One state file per environment per component (blast radius control)
- Never store state locally or in git
- Run
in CI,terraform plan
only after approvalterraform apply - Use directories for environment separation, modules for shared logic
Workflow 3: CI/CD Pipeline Design
python scripts/pipeline_generator.py /path/to/project --platform github-actions --json
The agent generates pipelines following these principles:
- Fail fast -- lint and unit tests before expensive integration tests
- Cache aggressively -- node_modules, Docker layers, pip packages
- Immutable artifacts -- build once, deploy the same artifact everywhere
- Gate promotions -- manual approval or smoke tests before production
- Parallel execution -- independent test suites and security scans run concurrently
Example: GitHub Actions with matrix testing and deployment gates
jobs: test: strategy: matrix: node-version: [18, 20] steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: { node-version: "${{ matrix.node-version }}", cache: npm } - run: npm ci && npm run lint && npm test -- --coverage build: needs: [test, security] if: github.ref == 'refs/heads/main' steps: - uses: docker/build-push-action@v5 with: push: true tags: "${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}" cache-from: type=gha cache-to: type=gha,mode=max deploy-staging: needs: build environment: staging steps: - run: helm upgrade --install app charts/myapp --set image.tag=${{ github.sha }} --wait deploy-production: needs: deploy-staging environment: production # requires manual approval
Validation checkpoint: Pipeline runs in under 15 minutes. All stages produce exit code 0.
Deployment Strategy Selection
| Strategy | Risk | Rollback Speed | Infra Cost | Best For |
|---|---|---|---|---|
| Rolling | Medium | Minutes | 1x | Stateless services, internal APIs |
| Blue-Green | Low | Seconds | 2x | Mission-critical, zero-downtime |
| Canary | Low | Seconds | 1.1x | User-facing, gradual validation |
| Feature Flags | Lowest | Instant | 1x | Granular control, A/B testing |
Canary promotion ladder:
- Deploy at 5% traffic. Monitor error rate and latency for 10 min.
- Promote to 25%. Monitor 10 min.
- Promote to 50%. Monitor 15 min.
- Promote to 100%.
- Automated rollback if error rate exceeds baseline by 2x at any step.
Monitoring Essentials
Every service dashboard includes the Four Golden Signals:
- Latency -- P50, P90, P99 response times
- Traffic -- Requests per second by endpoint and status code
- Errors -- 5xx rate, 4xx rate, application error codes
- Saturation -- CPU, memory, connection pool, queue depth
SLO targets (example):
| Service | SLI | SLO | Error Budget |
|---|---|---|---|
| API Gateway | Successful requests / Total | 99.9% (43.8 min/month downtime) | 0.1% |
| API Latency | Requests < 500ms / Total | P99 < 500ms | 1% |
When the error budget is exhausted, the agent recommends freezing feature deployments until the budget recovers.
Anti-Patterns
- Monolithic state -- one Terraform state for everything. Split by component and environment.
tag in production -- always use specific image tags.latest- Secrets in image layers -- inject at runtime via environment or mounted secrets. Verify with
.docker history --no-trunc - No resource limits -- every container needs CPU/memory limits to prevent noisy-neighbor attacks.
- Manual deployments -- automate with approval gates instead.
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Terraform state lock stuck | Interrupted left DynamoDB lock | after confirming no apply running |
Pods in | Failing health checks or missing config/secrets | , verify ConfigMaps/Secrets, increase |
| Docker builds slow (10+ min) | Layer cache invalidated by early COPY of changing files | Copy dependency manifests before source; use BuildKit cache mounts |
| Helm upgrade fails "another operation in progress" | Previous release in pending/failed state | , then |
| Canary shows healthy but users report errors | Metrics aggregated across all pods mask canary errors | Use per-revision metric labels; configure Istio/Nginx to tag canary traffic |
References
| Guide | Path | Content |
|---|---|---|
| CI/CD Pipeline Guide | | Pipeline patterns, platform comparisons, optimization |
| Infrastructure as Code | | Terraform patterns, module design, state management |
| Deployment Strategies | | Strategy details, rollback procedures, traffic management |
See also:
references/kubernetes_patterns.md for Helm charts, HPA/VPA/KEDA decisions, network policies, and RBAC patterns. references/cloud_platform_guide.md for AWS/GCP/Azure service comparison, multi-cloud strategy, and cost optimization.
Integration Points
| Skill | Integration |
|---|---|
| Security scanning in CI/CD, container image scanning, compliance checks |
| Infrastructure design decisions, service topology |
| Application containerization, health endpoints, config management |
| Terraform plan review, pipeline config review |
| Incident escalation, postmortem, rollback procedures |
Last Updated: April 2026 Version: 2.1.0