Claude-Skills senior-devops

install
source · Clone the upstream repo
git clone https://github.com/borghei/Claude-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/borghei/Claude-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/engineering/senior-devops" ~/.claude/skills/borghei-claude-skills-senior-devops && rm -rf "$T"
manifest: engineering/senior-devops/SKILL.md
source content

Senior DevOps Engineer

The agent generates CI/CD pipelines, scaffolds Terraform infrastructure, and manages deployments with strategy selection, health checks, and rollback support.


Quick Start

# Generate CI/CD pipeline from project analysis
python scripts/pipeline_generator.py <project-path> --platform github-actions --verbose

# Scaffold Terraform infrastructure
python scripts/terraform_scaffolder.py <target-path> --provider aws --env production --verbose

# Manage deployment with canary strategy
python scripts/deployment_manager.py <target-path> --strategy canary --verbose

Tools Overview

ToolInputOutput
pipeline_generator.py
Project pathCI/CD pipeline config (GitHub Actions, GitLab CI, Jenkins, CircleCI)
terraform_scaffolder.py
Target path + providerTerraform module structure with state config
deployment_manager.py
Target path + strategyDeployment plan with health checks and rollback

All tools support

--json
for machine-readable output and
--output
/
-o
for file writing.


Workflow 1: Containerize and Deploy

Step 1 -- Build a production Dockerfile.

The agent generates multi-stage Dockerfiles following this pattern:

# Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --only=production && npm cache clean --force
COPY . .
RUN npm run build

# Stage 2: Production
FROM node:20-alpine AS production
WORKDIR /app
RUN addgroup -g 1001 appgroup && \
    adduser -u 1001 -G appgroup -s /bin/sh -D appuser
COPY --from=builder --chown=appuser:appgroup /app/dist ./dist
COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules
COPY --from=builder --chown=appuser:appgroup /app/package.json ./
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/healthz || exit 1
CMD ["node", "dist/server.js"]

Validation checkpoint: Image builds with

docker build -t app:test .
and
docker run --rm app:test
returns healthy.

Step 2 -- Deploy to Kubernetes.

The agent creates a Deployment with probes, resource limits, and security context:

spec:
  containers:
    - name: app
      image: myapp:1.2.3
      resources:
        requests: { cpu: 250m, memory: 256Mi }
        limits: { cpu: "1", memory: 512Mi }
      livenessProbe:
        httpGet: { path: /healthz, port: 3000 }
        initialDelaySeconds: 15
        periodSeconds: 20
      readinessProbe:
        httpGet: { path: /ready, port: 3000 }
        initialDelaySeconds: 5
        periodSeconds: 10
      startupProbe:
        httpGet: { path: /healthz, port: 3000 }
        failureThreshold: 30
        periodSeconds: 10

Probe decision:

  • startupProbe: Slow-starting apps (JVM, model loading). Prevents liveness from killing during startup.
  • livenessProbe: Detects deadlocks. Keep simple -- do not check downstream dependencies.
  • readinessProbe: Controls traffic routing. Include dependency checks here.

Validation checkpoint:

kubectl get pods -l app=myapp
shows all pods Running and Ready.


Workflow 2: Infrastructure as Code with Terraform

Step 1 -- Scaffold the module structure.

python scripts/terraform_scaffolder.py ./infrastructure --provider aws --env production --verbose

The agent produces:

infrastructure/
  modules/
    vpc/         # main.tf, variables.tf, outputs.tf
    eks/
    rds/
  environments/
    staging/     # main.tf, terraform.tfvars, backend.tf
    production/

Step 2 -- Configure remote state.

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/infrastructure.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

Step 3 -- Run drift detection in CI.

terraform plan -detailed-exitcode -out=plan.tfplan
# Exit 0 = clean, Exit 1 = error, Exit 2 = drift detected

Validation checkpoint:

terraform plan
shows no unexpected changes. Drift alerts fire within 24 hours.

Key rules:

  • One state file per environment per component (blast radius control)
  • Never store state locally or in git
  • Run
    terraform plan
    in CI,
    terraform apply
    only after approval
  • Use directories for environment separation, modules for shared logic

Workflow 3: CI/CD Pipeline Design

python scripts/pipeline_generator.py /path/to/project --platform github-actions --json

The agent generates pipelines following these principles:

  1. Fail fast -- lint and unit tests before expensive integration tests
  2. Cache aggressively -- node_modules, Docker layers, pip packages
  3. Immutable artifacts -- build once, deploy the same artifact everywhere
  4. Gate promotions -- manual approval or smoke tests before production
  5. Parallel execution -- independent test suites and security scans run concurrently

Example: GitHub Actions with matrix testing and deployment gates

jobs:
  test:
    strategy:
      matrix:
        node-version: [18, 20]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: "${{ matrix.node-version }}", cache: npm }
      - run: npm ci && npm run lint && npm test -- --coverage

  build:
    needs: [test, security]
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: "${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}"
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy-staging:
    needs: build
    environment: staging
    steps:
      - run: helm upgrade --install app charts/myapp --set image.tag=${{ github.sha }} --wait

  deploy-production:
    needs: deploy-staging
    environment: production  # requires manual approval

Validation checkpoint: Pipeline runs in under 15 minutes. All stages produce exit code 0.


Deployment Strategy Selection

StrategyRiskRollback SpeedInfra CostBest For
RollingMediumMinutes1xStateless services, internal APIs
Blue-GreenLowSeconds2xMission-critical, zero-downtime
CanaryLowSeconds1.1xUser-facing, gradual validation
Feature FlagsLowestInstant1xGranular control, A/B testing

Canary promotion ladder:

  1. Deploy at 5% traffic. Monitor error rate and latency for 10 min.
  2. Promote to 25%. Monitor 10 min.
  3. Promote to 50%. Monitor 15 min.
  4. Promote to 100%.
  5. Automated rollback if error rate exceeds baseline by 2x at any step.

Monitoring Essentials

Every service dashboard includes the Four Golden Signals:

  1. Latency -- P50, P90, P99 response times
  2. Traffic -- Requests per second by endpoint and status code
  3. Errors -- 5xx rate, 4xx rate, application error codes
  4. Saturation -- CPU, memory, connection pool, queue depth

SLO targets (example):

ServiceSLISLOError Budget
API GatewaySuccessful requests / Total99.9% (43.8 min/month downtime)0.1%
API LatencyRequests < 500ms / TotalP99 < 500ms1%

When the error budget is exhausted, the agent recommends freezing feature deployments until the budget recovers.


Anti-Patterns

  1. Monolithic state -- one Terraform state for everything. Split by component and environment.
  2. latest
    tag in production
    -- always use specific image tags.
  3. Secrets in image layers -- inject at runtime via environment or mounted secrets. Verify with
    docker history --no-trunc
    .
  4. No resource limits -- every container needs CPU/memory limits to prevent noisy-neighbor attacks.
  5. Manual deployments -- automate with approval gates instead.

Troubleshooting

ProblemCauseSolution
Terraform state lock stuckInterrupted
terraform apply
left DynamoDB lock
terraform force-unlock <LOCK_ID>
after confirming no apply running
Pods in
CrashLoopBackOff
Failing health checks or missing config/secrets
kubectl logs <pod>
, verify ConfigMaps/Secrets, increase
startupProbe.failureThreshold
Docker builds slow (10+ min)Layer cache invalidated by early COPY of changing filesCopy dependency manifests before source; use BuildKit cache mounts
Helm upgrade fails "another operation in progress"Previous release in pending/failed state
helm history <release>
, then
helm rollback <release> <last-good>
Canary shows healthy but users report errorsMetrics aggregated across all pods mask canary errorsUse per-revision metric labels; configure Istio/Nginx to tag canary traffic

References

GuidePathContent
CI/CD Pipeline Guide
references/cicd_pipeline_guide.md
Pipeline patterns, platform comparisons, optimization
Infrastructure as Code
references/infrastructure_as_code.md
Terraform patterns, module design, state management
Deployment Strategies
references/deployment_strategies.md
Strategy details, rollback procedures, traffic management

See also:

references/kubernetes_patterns.md
for Helm charts, HPA/VPA/KEDA decisions, network policies, and RBAC patterns.
references/cloud_platform_guide.md
for AWS/GCP/Azure service comparison, multi-cloud strategy, and cost optimization.


Integration Points

SkillIntegration
senior-secops
Security scanning in CI/CD, container image scanning, compliance checks
senior-architect
Infrastructure design decisions, service topology
senior-backend
Application containerization, health endpoints, config management
code-reviewer
Terraform plan review, pipeline config review
incident-commander
Incident escalation, postmortem, rollback procedures

Last Updated: April 2026 Version: 2.1.0