Aiwg flow-deploy-to-production

Orchestrate production deployment with strategy selection, validation, automated rollback, and regression gates

install

source · Clone the upstream repo

git clone https://github.com/jmagly/aiwg

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agentic/code/frameworks/sdlc-complete/skills/flow-deploy-to-production" ~/.claude/skills/jmagly-aiwg-flow-deploy-to-production-e849ee && rm -rf "$T"

manifest: agentic/code/frameworks/sdlc-complete/skills/flow-deploy-to-production/SKILL.md

source content

Production Deployment Orchestration Flow

You are the Core Orchestrator for production deployment workflows.

Your Role

You orchestrate multi-agent workflows. You do NOT execute bash scripts.

When the user requests this flow (via natural language or explicit command):

Interpret the request and confirm understanding
Read this template as your orchestration guide
Extract agent assignments and workflow steps
Launch agents via Task tool in correct sequence
Synthesize results and finalize artifacts
Report completion with summary

Deployment Overview

Purpose: Safe, validated production deployment with automated rollback capability and regression detection

Key Activities:

Strategy selection (blue-green, canary, rolling)
Pre-deployment validation and gate checks
Regression detection at staging and production gates
Progressive deployment with SLO monitoring
Smoke tests and health validation
Automated rollback on regression or failure

Expected Duration: 30-90 minutes (varies by strategy), 10-15 minutes orchestration

Natural Language Triggers

Users may say:

"Deploy to production"
"Production deployment"
"Release to prod"
"Start deployment"
"Deploy version X.Y.Z"
"Go live with new release"
"Execute production rollout"

You recognize these as requests for this orchestration flow.

Parameter Handling

--guidance Parameter

Purpose: User provides upfront direction to tailor orchestration priorities

Examples:

--guidance "Zero-downtime critical, use blue-green strategy"
--guidance "High-risk release, progressive canary with 5% → 25% → 100%"
--guidance "Database migration included, need extended maintenance window"
--guidance "First production deployment, extra validation and monitoring"

How to Apply:

Parse guidance for keywords: strategy, risk level, timeline, validation depth
Adjust strategy selection (blue-green vs. canary vs. rolling)
Modify validation depth (minimal vs. comprehensive smoke tests)
Influence monitoring duration (15 min vs. 60 min observation)

--interactive Parameter

Purpose: You ask 6-8 strategic questions to understand deployment context

Questions to Ask (if --interactive):

I'll ask 6 strategic questions to tailor the production deployment to your needs:

Q1: What deployment strategy do you prefer?
    (blue-green = instant cutover, canary = progressive rollout, rolling = node-by-node)

Q2: What's the risk level of this release?
    (Low = routine updates, Medium = new features, High = architecture changes)

Q3: What are your SLO targets?
    (e.g., error rate <0.1%, latency p99 <500ms, availability >99.95%)

Q4: Is a database migration or schema change included?
    (Affects rollback complexity and strategy selection)

Q5: What's your rollback tolerance?
    (How quickly must you be able to rollback? Instant vs. 5 min vs. 30 min)

Q6: What's your monitoring observation period?
    (How long to monitor before declaring success? 15 min vs. 30 min vs. 60 min)

Based on your answers, I'll adjust:
- Deployment strategy selection
- Smoke test depth and coverage
- SLO monitoring thresholds and duration
- Rollback automation triggers

Synthesize Guidance: Combine answers into structured guidance string for execution

--regression-threshold Parameter

Purpose: Set acceptable regression rate for deployment gates

Default:

(zero tolerance - any regression blocks deployment)

Format:

--regression-threshold N

where N = percentage (0-100)

Examples:

# Zero tolerance (default)
/flow-deploy-to-production --regression-threshold 0

# Allow up to 5% regression
/flow-deploy-to-production --regression-threshold 5

# Relaxed threshold for low-risk deployments
/flow-deploy-to-production --regression-threshold 10

Application:

Staging Gate: If regression rate > threshold → BLOCK promotion to production
Production Gate: If regression rate > threshold → TRIGGER automatic rollback
Threshold applies to both test regression and metric regression

Typical Thresholds:

Risk Level	Recommended Threshold	Rationale
High	0%	Zero tolerance for critical releases
Medium	2-5%	Allow minor acceptable regression
Low	5-10%	Relaxed for routine updates

--rollback-on-regression Parameter

Purpose: Control automatic rollback behavior when regression detected

Default:

true

(automatic rollback enabled)

Format:

--rollback-on-regression

(boolean flag)

Examples:

# Automatic rollback enabled (default)
/flow-deploy-to-production --rollback-on-regression

# Manual intervention required
/flow-deploy-to-production --no-rollback-on-regression

Behavior:

If true: Production regression → Immediate automated rollback
If false: Production regression → Alert user, await manual decision
Recommended: Keep enabled for production safety

Artifacts to Generate

Primary Deliverables:

Deployment Readiness Report: Pre-flight validation →
```
.aiwg/deployment/deployment-readiness-report.md
```

Regression Gate Reports: Staging and production regression checks →

.aiwg/deployment/regression-gate-staging.md

.aiwg/deployment/regression-gate-production.md

Deployment Execution Log: Real-time progress tracking →
```
.aiwg/deployment/deployment-execution-log.md
```
SLO Monitoring Report: Metrics and breach detection →
```
.aiwg/deployment/slo-monitoring-report.md
```
Deployment Summary Report: Final outcome and lessons learned →
```
.aiwg/reports/deployment-report-{version}.md
```
Rollback Report (if needed): Rollback execution and RCA →
```
.aiwg/deployment/rollback-report-{version}.md
```

Supporting Artifacts:

Smoke test results (working doc)
SLO breach alerts (archived)
Infrastructure health snapshots (archived)
Regression analysis reports (archived)

Regression Gates

Purpose: Detect and block regressions before they reach production or impact users

Two-Gate Model:

Staging Regression Gate

When: After deployment to staging environment, before production cutover

What: Compare staging behavior against production baseline

Checks:

Test Regression: Run full regression suite, compare pass rates
Metric Regression: Compare error rates, latency, throughput
Smoke Test Regression: Validate critical paths match production behavior

Decision:

Pass (regression ≤ threshold): Proceed to production deployment
Fail (regression > threshold): BLOCK promotion, investigate, fix

Rationale: Catch regressions in staging before they impact real users

Production Regression Gate

When: After production deployment, during monitoring period

What: Compare production behavior against pre-deployment baseline

Checks:

Smoke Test Regression: Validate critical paths still pass
SLO Regression: Detect degradation in error rate, latency, availability
User Journey Regression: Validate top user flows still complete successfully

Decision:

Pass (regression ≤ threshold): Deployment successful
Fail (regression > threshold): TRIGGER automatic rollback (if
```
--rollback-on-regression
```
)

Rationale: Immediate detection and rollback on production regression

Regression Detection Methodology

Test Regression:

baseline: production test suite results
candidate: staging/production test suite results
regression_rate: (baseline_pass_count - candidate_pass_count) / baseline_pass_count * 100

threshold_check: regression_rate > threshold → FAIL

Metric Regression:

baseline: production metrics (pre-deployment)
candidate: staging/production metrics (post-deployment)

regressions_to_detect:
  - error_rate_increase > threshold
  - latency_p99_increase > threshold
  - throughput_decrease > threshold

threshold_check: any_regression > threshold → FAIL

Scope Options:

full: All tests and metrics (comprehensive)
critical: Only P0/P1 tests and critical metrics (faster)
targeted: Tests/metrics related to changed components

Integration with /regression-check

The regression gates use the

/regression-check

command:

Staging Gate:

/regression-check --baseline production --candidate staging --scope full --threshold {threshold}

Production Gate:

/regression-check --baseline production-pre-deploy --candidate production-post-deploy --scope critical --threshold {threshold}

Rollback Trigger on Regression

If production regression detected and

--rollback-on-regression

enabled:

regression_detected:
  trigger: regression_rate > threshold
  action: automated_rollback
  strategy: {blue-green | canary | rolling}
  notification:
    - alert_team
    - declare_incident
    - start_rca_process

Multi-Agent Orchestration Workflow

Step 1: Deployment Strategy Selection

Purpose: Choose deployment strategy based on risk, infrastructure, and requirements

Your Actions:

Analyze Context:

Read:
- .aiwg/intake/project-intake.md (understand project constraints)
- .aiwg/deployment/deployment-plan-*.md (existing deployment plans)
- .aiwg/architecture/software-architecture-doc.md (infrastructure capabilities)
- User guidance (--guidance parameter or interactive answers)

Launch Strategy Selection Agent:

Task(
    subagent_type="deployment-manager",
    description="Recommend deployment strategy",
    prompt="""
    Read project context and architecture documentation

    Recommend deployment strategy based on:

    **Blue-Green Strategy**:
    - When: Zero-downtime critical, instant rollback required
    - Requires: 2x infrastructure capacity (temporary)
    - Pros: Instant cutover, instant rollback, low risk
    - Cons: Higher cost, requires duplicate environments

    **Canary Strategy**:
    - When: Progressive validation needed, high-risk release
    - Requires: Traffic routing capability (Argo Rollouts, Flagger)
    - Pros: Gradual rollout, SLO-driven automation, cost-efficient
    - Cons: Slower rollout, complex automation setup

    **Rolling Strategy**:
    - When: Legacy systems, stateful services, simpler deployments
    - Requires: Basic orchestration (Kubernetes rollout)
    - Pros: No additional infrastructure, simple setup
    - Cons: Slower rollout, manual validation, harder rollback

    Analyze project requirements:
    - Risk level: {from guidance or interactive}
    - Infrastructure: {cloud provider, orchestration platform}
    - Downtime tolerance: {zero vs. minimal vs. acceptable}
    - Rollback requirements: {instant vs. fast vs. manual}

    Recommend strategy with rationale
    Save to: .aiwg/working/deployment/strategy-recommendation.md
    """
)

Confirm Strategy with User:

Read .aiwg/working/deployment/strategy-recommendation.md

Present to user:
─────────────────────────────────────────────
Deployment Strategy Recommendation
─────────────────────────────────────────────

**Recommended**: {Blue-Green | Canary | Rolling}

**Rationale**: {why this strategy fits project needs}

**Trade-offs**:
- Pros: {list benefits}
- Cons: {list drawbacks}

**Requirements**:
- Infrastructure: {what's needed}
- Duration: {expected deployment time}
- Monitoring: {observation period}

Proceed with this strategy? (yes/no)
─────────────────────────────────────────────

Communicate Progress:

✓ Analyzed deployment context
✓ Strategy recommendation: {Blue-Green | Canary | Rolling}
⏳ Awaiting user confirmation...

Step 2: Pre-Deployment Validation

Purpose: Verify all quality gates passed and environment is ready

Your Actions:

Launch Parallel Validation Agents:

# Agent 1: Quality Gate Validation
Task(
    subagent_type="project-manager",
    description="Validate all quality gates passed",
    prompt="""
    Check quality gate status for Transition phase

    Read gate criteria: $AIWG_ROOT/.../flows/gate-criteria-by-phase.md (Transition section)

    Validate gates:
    - [ ] Security Gate: No High/Critical vulnerabilities
    - [ ] Reliability Gate: SLOs met in staging
    - [ ] Test Gate: Integration tests 100% passing
    - [ ] Approval Gate: Release Manager signoff obtained

    Generate gate validation report:
    - Status: PASS | FAIL
    - Gate checklist with results
    - Decision: GO | NO-GO
    - Gaps (if NO-GO): List missing approvals or failures

    Save to: .aiwg/working/deployment/gate-validation-report.md
    """
)

# Agent 2: Environment Health Check
Task(
    subagent_type="devops-engineer",
    description="Validate production environment health",
    prompt="""
    Check production environment readiness

    Validate infrastructure health:
    - [ ] All nodes healthy (Kubernetes cluster)
    - [ ] No pods in crash loop or pending state
    - [ ] Database connections healthy
    - [ ] External integrations operational
    - [ ] Monitoring and alerting functional
    - [ ] No active P0/P1 incidents

    Check deployment artifacts:
    - [ ] Container images available and tagged
    - [ ] Checksums verified
    - [ ] Container signatures validated (if required)
    - [ ] Database migrations prepared (if applicable)

    Generate environment health report:
    - Infrastructure status: HEALTHY | DEGRADED | UNHEALTHY
    - Artifacts status: READY | MISSING | INVALID
    - Decision: GO | NO-GO
    - Issues (if NO-GO): List blockers

    Save to: .aiwg/working/deployment/environment-health-report.md
    """
)

# Agent 3: Rollback Plan Validation
Task(
    subagent_type="reliability-engineer",
    description="Validate rollback plan tested and ready",
    prompt="""
    Read rollback plan: .aiwg/deployment/rollback-plan-*.md

    Validate rollback readiness:
    - [ ] Rollback plan documented
    - [ ] Rollback tested in staging (within last 7 days)
    - [ ] Rollback automation configured (scripts, runbooks)
    - [ ] Rollback SLOs defined (how fast can we rollback?)
    - [ ] Communication plan for rollback scenario

    If database migration included:
    - [ ] Migration rollback script tested
    - [ ] Data backup verified
    - [ ] Backup restoration tested

    Generate rollback validation report:
    - Rollback readiness: READY | NOT_READY
    - Rollback strategy: {blue-green instant | canary abort | rolling undo}
    - Rollback SLO: {duration to full rollback}
    - Issues (if NOT_READY): List gaps

    Save to: .aiwg/working/deployment/rollback-validation-report.md
    """
)

Synthesize Pre-Deployment Readiness:

Task(
    subagent_type="deployment-manager",
    description="Synthesize deployment readiness report",
    prompt="""
    Read all validation reports:
    - .aiwg/working/deployment/gate-validation-report.md
    - .aiwg/working/deployment/environment-health-report.md
    - .aiwg/working/deployment/rollback-validation-report.md

    Synthesize Deployment Readiness Report:

    1. Overall Status: GO | CONDITIONAL_GO | NO-GO
    2. Gate Validation: {status and details}
    3. Environment Health: {status and details}
    4. Rollback Readiness: {status and details}
    5. Pre-Flight Checklist: {comprehensive checklist}
    6. Decision Rationale: {why GO or NO-GO}
    7. Conditions (if CONDITIONAL_GO): {what must be addressed}

    Use template: $AIWG_ROOT/.../templates/deployment/deployment-plan-card.md

    Output: .aiwg/deployment/deployment-readiness-report.md
    """
)

Decision Point:

Read .aiwg/deployment/deployment-readiness-report.md

If GO → Continue to Step 3
If CONDITIONAL_GO → Present conditions to user, wait for confirmation
If NO-GO → Report gaps, recommend remediation, STOP deployment

Communicate Progress:

⏳ Validating deployment readiness...
  ✓ Quality gates validated: {PASS | FAIL}
  ✓ Environment health checked: {HEALTHY | DEGRADED}
  ✓ Rollback plan validated: {READY | NOT_READY}
✓ Pre-deployment validation: {GO | CONDITIONAL_GO | NO-GO}

Step 2.5: Staging Regression Gate

Purpose: Detect regressions in staging before production deployment

Your Actions:

Deploy to Staging:

Task(
    subagent_type="devops-engineer",
    description="Deploy new version to staging environment",
    prompt="""
    Execute deployment to staging environment

    Actions:
    1. Deploy new version to staging
    2. Wait for all pods running and healthy
    3. Validate health checks passing

    Report:
    - Staging deployment status: SUCCESS | FAILED
    - Pods running: {count}/{total}
    - Health checks: {PASS | FAIL}

    Save to: .aiwg/working/deployment/staging-deployment-log.md

    If FAILED: Stop deployment, do NOT proceed to regression check
    """
)

Execute Staging Regression Check:

Task(
    subagent_type="regression-analyst",
    description="Run regression analysis on staging deployment",
    prompt="""
    Execute comprehensive regression check: staging vs. production baseline

    Command: /regression-check --baseline production --candidate staging --scope full --threshold {threshold}

    This checks:
    1. Test Regression: Full regression suite comparison
    2. Metric Regression: Error rate, latency, throughput comparison
    3. Smoke Test Regression: Critical path validation

    Parameters:
    - baseline: Production (current live version)
    - candidate: Staging (new version under test)
    - scope: full (comprehensive check)
    - threshold: {from --regression-threshold parameter, default 0}

    Generate regression gate report:
    - Regression rate: {percentage}
    - Test failures: {new failures not in baseline}
    - Metric degradations: {error rate, latency, throughput}
    - Decision: PASS | FAIL
    - Blockers (if FAIL): {list regression details}

    Use schema: @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/schemas/flows/regression-gate.yaml

    Output: .aiwg/deployment/regression-gate-staging.md

    If FAIL (regression > threshold): BLOCK promotion to production
    """
)

Staging Gate Decision Point:

Read .aiwg/deployment/regression-gate-staging.md

Check regression_rate vs. threshold:

If regression_rate <= threshold:
  ✓ Staging regression gate PASSED
  → Continue to Step 3 (Production Deployment)

If regression_rate > threshold:
  ❌ Staging regression gate FAILED
  → BLOCK promotion to production
  → Report regression details to user
  → Recommend investigation and fixes
  → STOP deployment

Decision logged in deployment execution log

Launch Regression Impact Assessment (if regression detected):

# Only if staging gate fails

Task(
    subagent_type="regression-analyst",
    description="Assess regression impact and root cause",
    prompt="""
    Staging regression detected: {regression_rate}% > threshold {threshold}%

    Analyze regression impact:

    1. Categorize Regressions:
       - Test regressions: {which tests now failing}
       - Metric regressions: {which metrics degraded}
       - Functional regressions: {which features broken}

    2. Assess Root Cause:
       - Code changes: {which commits introduced regression}
       - Configuration changes: {any config differences}
       - Dependency changes: {any library/service updates}

    3. Estimate Impact:
       - Severity: {CRITICAL | HIGH | MEDIUM | LOW}
       - User impact: {which user journeys affected}
       - Scope: {percentage of users affected}

    4. Recommend Remediation:
       - Immediate fix: {what to change}
       - Rollback option: {can we revert specific changes}
       - Validation: {how to verify fix}

    Use regression analysis agent tools and documentation

    Output: .aiwg/deployment/regression-impact-assessment-staging.md

    See: @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/agents/regression-analyst.md
    """
)

Communicate Progress:

⏳ Staging Regression Gate...
  ✓ Deployed to staging: {SUCCESS | FAILED}
  ✓ Regression check: {PASS | FAIL}
  {If PASS}
  ✓ Staging gate PASSED: Regression {rate}% ≤ threshold {threshold}%
  {If FAIL}
  ❌ Staging gate FAILED: Regression {rate}% > threshold {threshold}%
  ⏳ Analyzing regression impact...
  ✓ Impact assessment: .aiwg/deployment/regression-impact-assessment-staging.md
  ⚠️ Deployment BLOCKED until regressions fixed

Step 3: Execute Deployment (Strategy-Specific)

Purpose: Deploy new version using selected strategy with continuous monitoring

Blue-Green Deployment

Your Actions:

Deploy to Green Environment:

Task(
    subagent_type="devops-engineer",
    description="Deploy new version to green environment",
    prompt="""
    Execute blue-green deployment to green environment

    Actions:
    1. Deploy new version to green environment (while blue serves production)
    2. Wait for all green pods to be running and ready
    3. Validate green environment health checks passing

    Log all actions with timestamps to: .aiwg/working/deployment/execution-log-green.md

    Report:
    - Green deployment status: SUCCESS | FAILED
    - Pods running: {count}/{total}
    - Health checks: {PASS | FAIL}
    - Duration: {minutes}

    If FAILED: Stop deployment, do NOT cutover traffic
    """
)

Run Smoke Tests on Green:

Task(
    subagent_type="qa-engineer",
    description="Execute smoke tests on green environment",
    prompt="""
    Read smoke test plan: $AIWG_ROOT/.../templates/test/smoke-test-checklist.md

    Execute critical path smoke tests:
    - [ ] API health endpoints (200 OK)
    - [ ] User authentication flow
    - [ ] Critical business operations (top 5-10 journeys)
    - [ ] Database connectivity
    - [ ] External integrations (payment gateway, email, etc.)
    - [ ] Monitoring and logging operational

    Test against green environment URL (not production)

    Generate smoke test report:
    - Tests passed: {count}/{total}
    - Tests failed: {list failures with details}
    - Status: PASS | FAIL
    - Duration: {minutes}

    Save to: .aiwg/working/deployment/smoke-test-results-green.md

    If FAIL: Stop deployment, do NOT cutover traffic
    """
)

Cutover Traffic to Green:

# Only proceed if green deployment and smoke tests passed

Task(
    subagent_type="devops-engineer",
    description="Cutover production traffic to green environment",
    prompt="""
    Execute traffic cutover from blue to green

    Actions:
    1. Update load balancer or service selector to point to green
    2. Verify traffic is flowing to green (0% blue → 100% green)
    3. Confirm blue environment still running (for instant rollback)

    Log cutover actions with timestamps

    Report:
    - Cutover status: SUCCESS | FAILED
    - Traffic distribution: {blue-percentage}% blue, {green-percentage}% green
    - Timestamp: {cutover-time}

    Save to: .aiwg/working/deployment/execution-log-cutover.md
    """
)

Monitor SLOs Post-Cutover (Step 4 will handle this)

Canary Deployment

Your Actions:

Deploy Canary (1-5% Traffic):

Task(
    subagent_type="devops-engineer",
    description="Deploy canary version with 1-5% traffic",
    prompt="""
    Execute canary deployment with progressive rollout

    Stage 1: Deploy canary receiving 1-5% of production traffic

    Actions:
    1. Deploy canary version alongside stable baseline
    2. Configure traffic routing (1-5% to canary, 95-99% to baseline)
    3. Verify canary pods running and healthy

    Log all actions with timestamps

    Report:
    - Canary deployment status: SUCCESS | FAILED
    - Traffic distribution: {canary-percentage}% canary, {baseline-percentage}% baseline
    - Pods running: {count}/{total}
    - Health checks: {PASS | FAIL}

    Save to: .aiwg/working/deployment/execution-log-canary-stage1.md

    If FAILED: Stop deployment, scale down canary
    """
)

Launch SLO Monitoring Agent (runs continuously):

Task(
    subagent_type="reliability-engineer",
    description="Monitor canary SLOs at 1-5% stage",
    prompt="""
    Monitor SLOs for canary vs. baseline for 10-15 minutes

    Compare metrics:
    - Error rate: canary should not exceed baseline by >2%
    - Latency p99: canary should not exceed baseline by >20%
    - Throughput: canary should be proportional to traffic share

    SLO breach detection:
    - If error rate breach: ABORT deployment, trigger rollback
    - If latency breach: ABORT deployment, trigger rollback
    - If infrastructure failure: ABORT deployment, trigger rollback

    Generate monitoring report every 5 minutes:
    - Status: PASS | BREACH
    - Metrics comparison: {baseline vs. canary}
    - Decision: CONTINUE | ABORT

    Save to: .aiwg/working/deployment/slo-monitoring-canary-stage1.md

    If BREACH: Immediately notify orchestrator to trigger rollback
    """
)

Progressive Rollout (if SLOs pass):

# If Stage 1 SLOs pass, promote to 25%, then 50%, then 100%
# Repeat monitoring at each stage

Task(
    subagent_type="devops-engineer",
    description="Promote canary to 25% traffic",
    prompt="""
    Stage 2: Promote canary to 25% traffic

    Actions:
    1. Update traffic routing (25% canary, 75% baseline)
    2. Verify traffic distribution
    3. Monitor SLOs for 10-15 minutes (launch new monitoring agent)

    Report and save to: .aiwg/working/deployment/execution-log-canary-stage2.md

    If SLO breach at any stage: ABORT, trigger rollback
    """
)

# Repeat for 50% and 100% stages

Rolling Deployment

Your Actions:

Rolling Update Execution:

Task(
    subagent_type="devops-engineer",
    description="Execute rolling deployment node-by-node",
    prompt="""
    Execute rolling deployment strategy

    Actions:
    1. Update 1 instance/node at a time
    2. Wait for new instance to pass health checks
    3. Monitor for 5 minutes before proceeding to next instance
    4. Continue until all instances updated

    Pause conditions:
    - If health check fails: STOP, evaluate, rollback if needed
    - If error rate increases: STOP, evaluate, rollback if needed

    Log all actions with timestamps

    Report:
    - Instances updated: {count}/{total}
    - Current instance status: {status}
    - Health checks: {PASS | FAIL}
    - Error rate: {current vs. baseline}

    Save to: .aiwg/working/deployment/execution-log-rolling.md
    """
)

Communicate Progress (all strategies):

⏳ Executing {Blue-Green | Canary | Rolling} deployment...
  ✓ {Strategy-specific milestone 1}: {status}
  ✓ {Strategy-specific milestone 2}: {status}
  ⏳ {Strategy-specific milestone 3}: In progress...

Step 4: Monitor SLOs and Production Regression Gate

Purpose: Continuous validation that deployment is meeting reliability targets and no regressions introduced

Your Actions:

Capture Production Baseline (Pre-Deployment):

Task(
    subagent_type="reliability-engineer",
    description="Capture production baseline metrics before deployment",
    prompt="""
    Capture production baseline for regression comparison

    Metrics to capture:
    - Error rate (last 1 hour average)
    - Latency p50, p95, p99 (last 1 hour)
    - Throughput (requests/sec, last 1 hour)
    - Availability (uptime percentage)
    - Critical user journey completion rates

    Test suite state:
    - Smoke test pass rate (current)
    - Integration test pass rate (current)
    - Critical path test results

    Save baseline snapshot:
    - Timestamp: {pre-deployment time}
    - Version: {current production version}
    - Metrics: {captured values}

    Output: .aiwg/working/deployment/production-baseline-snapshot.md

    This baseline used for post-deployment regression comparison
    """
)

Launch Parallel Monitoring Agents:

# Agent 1: SLO Monitoring
Task(
    subagent_type="reliability-engineer",
    description="Monitor production SLOs post-deployment",
    prompt="""
    Monitor SLOs for 15-30 minutes post-deployment (or as specified in guidance)

    Key SLOs to track:
    - Error rate: {target <0.1% or custom}
    - Latency p99: {target <500ms or custom}
    - Throughput: {baseline ±10% or custom}
    - Availability: {target >99.95% or custom}

    Compare current metrics vs. baseline (pre-deployment)

    Automated breach detection:
    - Error rate >2% above baseline → TRIGGER ROLLBACK
    - Latency p99 >20% above baseline → TRIGGER ROLLBACK
    - Throughput drop >30% below baseline → TRIGGER ROLLBACK
    - Infrastructure alarms triggered → TRIGGER ROLLBACK

    Generate SLO monitoring report every 5 minutes:
    - Status: PASS | BREACH
    - Metrics: {current vs. baseline vs. target}
    - Alerts triggered: {count and details}
    - Decision: CONTINUE | ROLLBACK

    Save to: .aiwg/deployment/slo-monitoring-report.md

    If BREACH: Immediately notify orchestrator to trigger rollback (Step 5)
    """
)

# Agent 2: Smoke Tests (Production)
Task(
    subagent_type="qa-engineer",
    description="Execute smoke tests against production",
    prompt="""
    Run smoke tests against production environment post-deployment

    Execute critical path tests:
    - [ ] API health endpoints
    - [ ] User authentication and authorization
    - [ ] Top 5-10 business operations
    - [ ] Database read/write operations
    - [ ] External integrations
    - [ ] Monitoring and logging functional

    Test against production URL (real traffic)

    Generate smoke test report:
    - Tests passed: {count}/{total}
    - Tests failed: {list failures with details}
    - Status: PASS | FAIL
    - Duration: {minutes}

    Save to: .aiwg/working/deployment/smoke-test-results-production.md

    If FAIL: Immediately notify orchestrator to trigger rollback
    """
)

# Agent 3: Infrastructure Health Monitoring
Task(
    subagent_type="devops-engineer",
    description="Monitor infrastructure health continuously",
    prompt="""
    Monitor infrastructure health for 15-30 minutes post-deployment

    Track infrastructure metrics:
    - Pod restarts: >3 restarts in 5 minutes → ALERT
    - OOM kills: Any OOM kill → TRIGGER ROLLBACK
    - Node health: Any unhealthy nodes → ALERT
    - Network errors: Elevated error rates → ALERT

    Generate infrastructure health report every 5 minutes:
    - Status: HEALTHY | DEGRADED | UNHEALTHY
    - Pods: {running}/{total}, restarts: {count}
    - Nodes: {healthy}/{total}
    - Alerts: {list active alerts}

    Save to: .aiwg/working/deployment/infrastructure-health-report.md

    If UNHEALTHY: Immediately notify orchestrator to trigger rollback
    """
)

# Agent 4: Production Regression Gate
Task(
    subagent_type="regression-analyst",
    description="Execute production regression check",
    prompt="""
    Execute production regression gate check

    Command: /regression-check --baseline production-pre-deploy --candidate production-post-deploy --scope critical --threshold {threshold}

    This checks:
    1. Smoke Test Regression: Critical path validation
    2. SLO Regression: Error rate, latency, availability
    3. User Journey Regression: Top user flows completion

    Parameters:
    - baseline: Production snapshot (pre-deployment)
    - candidate: Production current (post-deployment)
    - scope: critical (P0/P1 tests and critical metrics only)
    - threshold: {from --regression-threshold parameter, default 0}

    Compare against baseline from .aiwg/working/deployment/production-baseline-snapshot.md

    Generate regression gate report:
    - Regression rate: {percentage}
    - New failures: {tests now failing}
    - Metric degradations: {SLO breaches}
    - Decision: PASS | FAIL
    - Severity: {CRITICAL | HIGH | MEDIUM | LOW}

    Use schema: @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/schemas/flows/regression-gate.yaml

    Output: .aiwg/deployment/regression-gate-production.md

    If FAIL (regression > threshold):
      - Check --rollback-on-regression flag
      - If enabled: Trigger automatic rollback
      - If disabled: Alert user, await manual decision
    """
)

Production Regression Gate Decision Point:

Read .aiwg/deployment/regression-gate-production.md

Check regression_rate vs. threshold:

If regression_rate <= threshold:
  ✓ Production regression gate PASSED
  → Continue monitoring, deployment successful

If regression_rate > threshold:
  ❌ Production regression gate FAILED
  → Check --rollback-on-regression flag:

    If --rollback-on-regression enabled (default):
      ⚠️ Automatic rollback TRIGGERED
      → Execute Step 5 (Rollback)

    If --rollback-on-regression disabled:
      ⚠️ Regression detected, awaiting manual decision
      → Alert user with regression details
      → User must decide: [rollback | accept risk | investigate]

Decision logged in deployment execution log

Aggregate Monitoring Results:

# You monitor all 4 agents continuously
# If ANY agent reports BREACH, FAIL, or UNHEALTHY → Trigger Step 5 (Rollback)
# If production regression gate FAILS and --rollback-on-regression → Trigger Step 5 (Rollback)
# If ALL agents report success for monitoring duration → Proceed to Step 6 (Success)

Communicate Progress:

⏳ Monitoring deployment health (15-30 min observation)...
  ✓ Production baseline captured: .aiwg/working/deployment/production-baseline-snapshot.md
  ✓ SLO monitoring: {PASS | BREACH} (updated every 5 min)
  ✓ Smoke tests: {PASS | FAIL}
  ✓ Infrastructure health: {HEALTHY | DEGRADED}
  ✓ Production regression gate: {PASS | FAIL} (regression {rate}% vs. threshold {threshold}%)
  {If regression detected}
  ⚠️ Regression detected: {rate}% > threshold {threshold}%
  {If --rollback-on-regression enabled}
  ⏳ Automatic rollback triggered...
  {If --rollback-on-regression disabled}
  ⚠️ Awaiting manual decision: [rollback | accept | investigate]
[If all passing for full duration]
✓ Deployment validation complete: All checks passed, no regressions

Step 5: Rollback (If Failure or Regression Detected)

Purpose: Automated rollback execution if SLO breach, regression, or failure detected

Trigger Conditions:

SLO breach detected (error rate, latency, throughput)
Smoke test failure
Infrastructure health degraded or unhealthy
Regression detected at production gate (if
```
--rollback-on-regression
```
enabled)
Manual abort by user

Your Actions:

Execute Strategy-Specific Rollback:

Task(
    subagent_type="devops-engineer",
    description="Execute automated rollback",
    prompt="""
    ROLLBACK TRIGGERED: {reason - SLO breach | smoke test failure | infrastructure failure | regression detected}

    Execute rollback strategy: {Blue-Green | Canary | Rolling}

    **Blue-Green Rollback**:
    1. Switch traffic back to blue environment (instant cutover)
    2. Verify traffic flowing to blue (100% blue, 0% green)
    3. Confirm SLOs return to baseline
    4. Scale down green environment (optional, for cost)

    **Canary Rollback**:
    1. Abort canary rollout (stop progressive promotion)
    2. Scale down canary pods to 0
    3. Verify baseline serving 100% traffic
    4. Confirm SLOs return to baseline

    **Rolling Rollback**:
    1. Trigger rollout undo (revert to previous version)
    2. Wait for all pods to rollback to previous version
    3. Verify all pods running previous version
    4. Confirm SLOs return to baseline

    Log all rollback actions with timestamps

    Report:
    - Rollback trigger: {reason}
    - Rollback strategy: {strategy}
    - Rollback status: SUCCESS | FAILED
    - Duration: {minutes}
    - Traffic distribution: {current}
    - SLOs post-rollback: {status}

    Save to: .aiwg/deployment/rollback-execution-log.md

    If rollback FAILED: CRITICAL ESCALATION to Incident Commander
    """
)

Verify Rollback Success:

Task(
    subagent_type="reliability-engineer",
    description="Validate rollback successful",
    prompt="""
    Verify rollback restored stable state

    Validate:
    - [ ] Old version serving 100% traffic
    - [ ] SLOs returned to baseline (error rate, latency)
    - [ ] Smoke tests passing on rolled-back version
    - [ ] No active alerts or incidents
    - [ ] Infrastructure health restored

    Generate rollback validation report:
    - Status: SUCCESS | PARTIAL | FAILED
    - SLOs: {current vs. baseline}
    - Smoke tests: {PASS | FAIL}
    - Stability: {STABLE | UNSTABLE}

    Save to: .aiwg/working/deployment/rollback-validation-report.md

    If rollback FAILED or PARTIAL: CRITICAL ESCALATION
    """
)

Post-Rollback Regression Verification:

# Only if rollback triggered by regression

Task(
    subagent_type="regression-analyst",
    description="Verify rollback eliminated regression",
    prompt="""
    Rollback complete, verify regression eliminated

    Execute regression check post-rollback:
    - Baseline: Production (pre-deployment snapshot)
    - Candidate: Production (post-rollback)

    Command: /regression-check --baseline production-pre-deploy --candidate production-post-rollback --scope critical

    Validate:
    - [ ] Regression eliminated (rate back to 0%)
    - [ ] All tests back to passing
    - [ ] Metrics returned to baseline

    Generate post-rollback regression report:
    - Regression eliminated: YES | NO
    - Tests restored: {count}/{total}
    - Metrics restored: {error rate, latency, throughput}

    Output: .aiwg/deployment/regression-verification-post-rollback.md

    If regression NOT eliminated: Deeper issue, escalate
    """
)

Declare Incident and Initiate RCA:

Task(
    subagent_type="incident-commander",
    description="Declare incident and start root cause analysis",
    prompt="""
    Deployment failed, rollback executed

    Declare incident:
    - Severity: P0 (if production impact) | P1 (if rollback successful)
    - Summary: Deployment rollback - {reason}
    - Impact: {describe user/business impact}
    - Timeline: {deployment start → failure detected → rollback completed}

    Include regression details if applicable:
    - Regression rate: {percentage}
    - Regression threshold: {threshold}
    - Specific regressions: {test failures, metric degradations}

    Initiate root cause analysis:
    - Assemble incident response team
    - Preserve all logs and metrics (deployment logs, SLO data, infrastructure snapshots, regression reports)
    - Schedule incident review meeting

    Use template: $AIWG_ROOT/.../templates/deployment/incident-report-template.md

    Output incident report: .aiwg/deployment/rollback-report-{version}.md
    """
)

Communicate Progress:

❌ Deployment failure detected: {reason}
{If regression triggered}
⚠️ Regression detected: {rate}% > threshold {threshold}%
⏳ Executing automated rollback...
  ✓ Rollback execution: {SUCCESS | FAILED}
  ✓ Rollback validation: {SUCCESS | PARTIAL | FAILED}
  ✓ Regression verification: {ELIMINATED | PERSISTS}
  ✓ Incident declared: {incident-id}
✓ Rollback complete: Production restored to previous version

Step 6: Generate Deployment Report (Success Path)

Purpose: Document successful deployment with metrics and lessons learned

Your Actions:

Synthesize Deployment Summary:

Task(
    subagent_type="deployment-manager",
    description="Generate comprehensive deployment report",
    prompt="""
    Read all deployment artifacts:
    - .aiwg/deployment/deployment-readiness-report.md (pre-flight)
    - .aiwg/deployment/regression-gate-staging.md (staging regression check)
    - .aiwg/working/deployment/execution-log-*.md (deployment actions)
    - .aiwg/deployment/slo-monitoring-report.md (SLO validation)
    - .aiwg/deployment/regression-gate-production.md (production regression check)
    - .aiwg/working/deployment/smoke-test-results-*.md (test results)
    - .aiwg/working/deployment/infrastructure-health-report.md (health checks)

    Generate Deployment Summary Report:

    1. Overview
       - Status: SUCCESS | ROLLBACK
       - Version: {version}
       - Strategy: {Blue-Green | Canary | Rolling}
       - Duration: {total deployment time}

    2. Deployment Timeline
       - Pre-deployment validation: {timestamp, duration}
       - Staging regression gate: {timestamp, result}
       - Deployment execution: {timestamp, duration, strategy milestones}
       - Smoke tests: {timestamp, duration}
       - SLO monitoring: {timestamp, duration}
       - Production regression gate: {timestamp, result}
       - Completion: {timestamp}

    3. Pre-Deployment Validation Results
       - Quality gates: {status}
       - Environment health: {status}
       - Rollback readiness: {status}

    4. Regression Gate Results
       - Staging gate: {PASS | FAIL, regression rate: {rate}%}
       - Production gate: {PASS | FAIL, regression rate: {rate}%}
       - Threshold: {threshold}%
       - Rollback triggered: {YES | NO}

    5. Deployment Execution Details (strategy-specific)
       - {Blue-Green: green deployment, cutover, blue retirement}
       - {Canary: 1-5%, 25%, 50%, 100% progression}
       - {Rolling: node-by-node updates}

    6. SLO Metrics
       - Error rate: {baseline vs. post-deployment vs. target}
       - Latency p99: {baseline vs. post-deployment vs. target}
       - Throughput: {baseline vs. post-deployment}
       - Availability: {uptime percentage}

    7. Smoke Test Results
       - Critical path tests: {passed/total}
       - API tests: {passed/total}
       - Database tests: {passed/total}
       - Integration tests: {passed/total}

    8. Infrastructure Health
       - Pods: {running/total, restarts}
       - Nodes: {healthy/total}
       - Alerts triggered: {count, resolved}

    9. Issues and Resolutions
       - {list any issues encountered and how resolved}

    10. Lessons Learned
       - What went well: {list successes}
       - What could improve: {list improvement opportunities}
       - Action items: {concrete actions for future deployments}

    11. Approvals and Sign-offs
       - Deployment Manager: {name, timestamp}
       - Reliability Engineer: {name, timestamp}
       - Operations Lead: {name, timestamp}

    Use template: $AIWG_ROOT/.../templates/deployment/deployment-report-template.md

    Output: .aiwg/reports/deployment-report-{version}.md
    """
)

Archive Workflow Artifacts:

# You handle archiving directly

Archive complete deployment workflow:
- Move .aiwg/working/deployment/ to .aiwg/archive/{date}/deployment-{version}/
- Create audit trail: .aiwg/archive/{date}/deployment-{version}/audit-trail.md

Audit trail includes:
- Timeline of all agent actions
- Decision points and outcomes
- Regression gate results
- SLO metrics snapshots
- Final deployment report location

Communicate Progress:

⏳ Generating deployment report...
✓ Deployment report complete: .aiwg/reports/deployment-report-{version}.md
✓ Workflow archived: .aiwg/archive/{date}/deployment-{version}/

Step 7: Notify Stakeholders and Update Documentation

Purpose: Communicate deployment outcome and update operational documentation

Your Actions:

Generate Stakeholder Notification:

Task(
    subagent_type="support-lead",
    description="Draft stakeholder notification",
    prompt="""
    Read deployment report: .aiwg/reports/deployment-report-{version}.md

    Draft stakeholder notification email:

    Subject: Production Deployment {SUCCESS | COMPLETED WITH ROLLBACK} - Version {version}

    Body:
    - Deployment status: {SUCCESS | ROLLBACK}
    - Version deployed: {version}
    - Strategy used: {Blue-Green | Canary | Rolling}
    - Duration: {total time}
    - Regression gates: {PASSED | FAILED}
    - SLO metrics: {error rate, latency, availability}
    - User impact: {None | Minimal | Moderate}
    - Next steps: {monitoring, support readiness, known issues}

    Audience:
    - Executive stakeholders
    - Product team
    - Customer support team
    - Engineering team

    Save notification draft to: .aiwg/working/deployment/stakeholder-notification-{version}.md

    Note: User must review and send notification
    """
)

Update Operational Runbooks:

Task(
    subagent_type="deployment-manager",
    description="Update deployment runbooks with lessons learned",
    prompt="""
    Read lessons learned from deployment report

    Update deployment runbooks with improvements:
    - Process improvements identified
    - New smoke tests to add
    - SLO threshold adjustments
    - Regression gate threshold tuning
    - Monitoring improvements
    - Rollback procedure refinements

    Document updates to: .aiwg/deployment/deployment-runbook-updates-{version}.md

    Note: User should review and apply to operational documentation
    """
)

Communicate Progress:

✓ Stakeholder notification drafted: .aiwg/working/deployment/stakeholder-notification-{version}.md
✓ Runbook updates documented: .aiwg/deployment/deployment-runbook-updates-{version}.md

Final Summary Report

Present to User:

─────────────────────────────────────────────
Production Deployment Complete
─────────────────────────────────────────────

**Status**: {SUCCESS | ROLLBACK}
**Version**: {version}
**Strategy**: {Blue-Green | Canary | Rolling}
**Duration**: {total deployment time}

**Timeline**:
- Pre-deployment validation: {duration}
- Staging regression gate: {PASS | FAIL, regression {rate}%}
- Deployment execution: {duration}
- SLO monitoring: {duration}
- Production regression gate: {PASS | FAIL, regression {rate}%}
- Total: {total duration}

**Regression Gates**:
✓ Staging gate: {PASS | FAIL} (threshold: {threshold}%)
✓ Production gate: {PASS | FAIL} (threshold: {threshold}%)
{If regression detected}
⚠️ Regression: {rate}% {≤ | >} threshold {threshold}%
{If rollback triggered}
✓ Automatic rollback: {SUCCESS | FAILED}

**SLO Metrics**:
✓ Error rate: {current}% (target: <{target}%)
✓ Latency p99: {current}ms (target: <{target}ms)
✓ Throughput: {current} req/s (baseline: {baseline} req/s)
✓ Availability: {percentage}%

**Smoke Tests**: {passed}/{total} passed

**Infrastructure**: {pods} pods running, {nodes} nodes healthy

**Artifacts Generated**:
- Deployment Readiness Report: .aiwg/deployment/deployment-readiness-report.md
- Staging Regression Gate: .aiwg/deployment/regression-gate-staging.md
- Production Regression Gate: .aiwg/deployment/regression-gate-production.md
- SLO Monitoring Report: .aiwg/deployment/slo-monitoring-report.md
- Deployment Summary Report: .aiwg/reports/deployment-report-{version}.md
{If rollback}
- Rollback Report: .aiwg/deployment/rollback-report-{version}.md
- Regression Impact Assessment: .aiwg/deployment/regression-impact-assessment-*.md
- Incident Report: {link to incident}

**Next Steps**:
- Review deployment report for lessons learned
- Send stakeholder notification (draft ready)
- Update operational runbooks with improvements
- Continue monitoring SLOs for 24-48 hours
{If rollback}
- Conduct incident review meeting
- Perform root cause analysis on regression
- Fix issues identified
- Plan re-deployment

─────────────────────────────────────────────

Regression-Blocked Deployment Example

Scenario: Staging regression gate detects 8% test failure rate (threshold: 5%)

─────────────────────────────────────────────
Deployment BLOCKED: Staging Regression Gate Failed
─────────────────────────────────────────────

**Status**: BLOCKED
**Version**: v2026.1.5
**Strategy**: Blue-Green
**Regression Detected**: 8.0% > threshold 5.0%

**Staging Regression Gate**:
❌ FAILED: 8.0% regression rate exceeds threshold
- Test failures: 12 new failures (150 total, 138 baseline)
- Metric degradations:
  • Error rate: 0.15% (+0.12% from baseline)
  • Latency p99: 580ms (+18% from baseline)
- Critical path: 2/10 user journeys failing

**Impact Assessment**:
- Severity: HIGH
- Affected areas: Authentication, Payment processing
- Root cause: New rate limiting logic breaking retry logic
- User impact: 15-20% of login attempts would fail

**Remediation Required**:
1. Revert rate limiting changes (commit abc123)
2. Add retry backoff logic to authentication client
3. Update tests to cover rate limiting scenarios
4. Re-deploy to staging and re-run regression check

**Artifacts**:
- Regression gate report: .aiwg/deployment/regression-gate-staging.md
- Impact assessment: .aiwg/deployment/regression-impact-assessment-staging.md

**Decision**: Deployment BLOCKED until regressions resolved

─────────────────────────────────────────────

Quality Gates

Before marking workflow complete, verify:

User Communication

At start: Confirm understanding and strategy

Understood. I'll orchestrate the production deployment.

Strategy: {Blue-Green | Canary | Rolling}
Version: {version}
Duration: {expected deployment time}
Regression Threshold: {threshold}%
Rollback on Regression: {enabled | disabled}

This will:
- Validate deployment readiness (quality gates, environment, rollback)
- Staging regression gate (block if regression > {threshold}%)
- Execute {strategy} deployment with continuous monitoring
- Production regression gate (rollback if regression > {threshold}%)
- Run smoke tests and SLO validation
- Automated rollback on failure or regression
- Generate comprehensive deployment report

Expected duration: {30-90 minutes for deployment, 10-15 minutes orchestration}

Starting orchestration...

During: Update progress with clear indicators

✓ = Complete
⏳ = In progress
❌ = Error/blocked
⚠️ = Warning/attention needed

At end: Summary report with status and artifacts (see Final Summary Report above)

Error Handling

Quality Gate Failure:

❌ Pre-deployment validation failed: {gate-name}

Gaps identified:
- {list missing approvals or failures}

Recommendation: Resolve gate failures before proceeding
- Command: /gate-check transition
- Fix: {specific actions to resolve each gap}

Deployment BLOCKED until all gates pass.

Staging Regression Gate Failure:

❌ Staging regression gate failed: {regression-rate}% > threshold {threshold}%

Regressions detected:
- Test failures: {count new failures}
- Metric degradations: {error rate, latency, throughput}
- Critical paths affected: {list user journeys}

Impact assessment: .aiwg/deployment/regression-impact-assessment-staging.md

Recommendation: Fix regressions before production deployment
- Root cause: {likely causes}
- Fix: {specific remediation actions}
- Re-test: Run /regression-check after fixes

Deployment BLOCKED until regressions resolved.

Environment Health Degraded:

❌ Production environment unhealthy: {issue-description}

Issues:
- {list infrastructure problems}

Recommendation: Resolve environment issues before deploying
- Check: {kubectl commands to diagnose}
- Fix: {specific remediation actions}

Deployment BLOCKED until environment healthy.

Deployment Execution Failure:

❌ Deployment failed during execution: {error-message}

Error: {description}

Recommended Actions:
1. Review deployment logs: {path}
2. Check {specific components}: {diagnostic commands}
3. Fix identified issue
4. Re-run deployment

No rollback needed (failure before cutover to new version).

SLO Breach Detected:

❌ SLO breach detected: {slo-name}
- Current: {current-value}
- Target: {target-value}
- Threshold: {threshold-value}

Automated rollback TRIGGERED

⏳ Executing rollback...
✓ Rollback complete: Production restored to version {previous-version}
✓ SLOs returned to baseline

Incident declared: {incident-id}
Root cause analysis initiated.

Recommended Actions:
1. Review SLO monitoring report: {path}
2. Analyze failure reason: {likely causes}
3. Fix issues identified
4. Validate in staging
5. Plan re-deployment

Production Regression Gate Failure:

❌ Production regression gate failed: {regression-rate}% > threshold {threshold}%

Regressions detected:
- Test failures: {count new failures}
- Metric degradations: {error rate, latency, availability}
- User journeys affected: {list critical paths}

{If --rollback-on-regression enabled}
Automated rollback TRIGGERED

⏳ Executing rollback...
✓ Rollback complete: Production restored to version {previous-version}
✓ Regression verification: {ELIMINATED | PERSISTS}

{If --rollback-on-regression disabled}
⚠️ Manual decision required:
- [rollback] Rollback to previous version
- [accept] Accept regression risk, continue monitoring
- [investigate] Keep deployed, investigate root cause

Regression details: .aiwg/deployment/regression-gate-production.md

Incident declared: {incident-id}
Root cause analysis initiated.

Recommended Actions:
1. Review regression gate report: {path}
2. Analyze regression root cause: {likely causes}
3. Fix issues identified
4. Validate in staging with /regression-check
5. Plan re-deployment or accept risk

Rollback Failure (CRITICAL):

❌ CRITICAL: Rollback failed - {error-message}

This is a P0 production incident.

IMMEDIATE ACTIONS:
1. Escalate to Incident Commander
2. Assemble incident response team
3. Manual intervention required: {specific actions}
4. Notify all stakeholders of service disruption

Incident declared: {incident-id}
Incident Commander: {contact info}

Do NOT attempt re-deployment until incident resolved.

Smoke Test Failure:

❌ Smoke tests failed: {test-name}

Failed tests:
- {list failures with details}

If deployed to production: Automated rollback TRIGGERED
If not yet cutover: Deployment STOPPED (no rollback needed)

Recommended Actions:
1. Review smoke test logs: {path}
2. Diagnose failure: {likely causes}
3. Fix issues identified
4. Re-run smoke tests in staging
5. Plan re-deployment

Success Criteria

This orchestration succeeds when:

Metrics to Track

During orchestration, track:

Deployment duration: {total time from start to completion}
Regression rates: {staging and production}
SLO metrics: {error rate, latency, throughput, availability}
Smoke test pass rate: {passed/total}
Infrastructure health: {pods running, nodes healthy, alerts triggered}
Rollback count: {number of rollbacks triggered}
Regression-triggered rollbacks: {count}
Mean time to rollback (MTTR): {time from failure detection to rollback completion}

References

Templates (via $AIWG_ROOT):

Deployment Plan:

templates/deployment/deployment-plan-card.md

Rollback Plan:

templates/deployment/rollback-plan-card.md

SLO Definition:

templates/deployment/slo-definition-template.md

Smoke Test Checklist:
```
templates/test/smoke-test-checklist.md
```

Deployment Report:

templates/deployment/deployment-report-template.md

Incident Report:

templates/deployment/incident-report-template.md

Gate Criteria:

```
flows/gate-criteria-by-phase.md
```
(Transition section)

Schemas:

@$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/schemas/flows/regression-gate.yaml - Regression gate format

Commands:

@$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/commands/regression-check.md - Regression detection command

Agents:

@$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/agents/regression-analyst.md - Regression analysis agent

Multi-Agent Pattern:

docs/multi-agent-documentation-pattern.md

Orchestrator Architecture:

```
docs/orchestrator-architecture.md
```

Natural Language Translations:

```
docs/simple-language-translations.md
```