Aiwg flow-hypercare-monitoring

Orchestrate hypercare monitoring period with 24/7 support, SLO tracking, and rapid issue response

install

source · Clone the upstream repo

git clone https://github.com/jmagly/aiwg

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agentic/code/frameworks/sdlc-complete/skills/flow-hypercare-monitoring" ~/.claude/skills/jmagly-aiwg-flow-hypercare-monitoring-0717e6 && rm -rf "$T"

manifest: agentic/code/frameworks/sdlc-complete/skills/flow-hypercare-monitoring/SKILL.md

source content

Hypercare Monitoring Flow

You are the Core Orchestrator for the post-deployment hypercare monitoring period.

Your Role

You orchestrate multi-agent workflows. You do NOT execute bash scripts.

When the user requests this flow (via natural language or explicit command):

Interpret the request and confirm understanding
Read this template as your orchestration guide
Extract agent assignments and workflow steps
Launch agents via Task tool in correct sequence
Synthesize results and finalize artifacts
Report completion with summary

Hypercare Overview

Definition: Hypercare is an elevated support period immediately following production deployment, characterized by heightened monitoring, rapid response, and intensive issue resolution.

Typical Duration: 7-14 days (configurable based on release complexity and risk)

Focus Areas:

Production stability and SLO compliance
Rapid incident identification and response
User adoption and feedback collection
Support team enablement
Smooth transition to business-as-usual operations

Exit Criteria:

Zero P0 (Critical) incidents in last 48 hours
Zero P1 (High) incidents in last 24 hours
All SLOs met for 72 consecutive hours
User adoption metrics trending positive
Support team ready for standard operations
Hypercare report complete and approved

Expected Duration: 7-14 days (typical), 20-30 minutes orchestration

Natural Language Triggers

Users may say:

"Start hypercare"
"Begin hypercare period"
"Post-launch monitoring"
"24/7 support period"
"Activate hypercare monitoring"
"Launch post-deployment support"

You recognize these as requests for this orchestration flow.

Parameter Handling

Hypercare Duration Parameter

Purpose: Specify hypercare period length

Examples:

/flow-hypercare-monitoring 7 .
/flow-hypercare-monitoring 14 .

Default: 7 days (low-risk deployments), 14 days (high-risk deployments)

--guidance Parameter

Purpose: User provides upfront direction to tailor hypercare priorities

Examples:

--guidance "Focus on security monitoring, financial transaction integrity critical"
--guidance "Performance is key, sub-200ms p95 response time SLO"
--guidance "First production launch, team needs extra support and documentation"
--guidance "High-traffic deployment, anticipate 100K daily active users"

How to Apply:

Parse guidance for keywords: security, performance, compliance, scale, team experience
Adjust agent assignments (add security-gatekeeper, performance-engineer for specific focuses)
Modify monitoring depth (lightweight vs comprehensive based on complexity)
Influence priority ordering (stability vs. adoption focus)

--interactive Parameter

Purpose: You ask 5-8 strategic questions to understand project context

Questions to Ask (if --interactive):

I'll ask 8 strategic questions to tailor hypercare to your needs:

Q1: What are your top priorities for hypercare?
    (e.g., stability validation, user adoption, performance monitoring)

Q2: What's the deployment risk level?
    (Helps determine monitoring intensity and duration)

Q3: What are your critical SLOs?
    (Availability, response time, error rate targets)

Q4: What's your expected user volume?
    (Helps set alert thresholds and capacity monitoring)

Q5: What's your support team's experience level?
    (Influences runbook detail and escalation paths)

Q6: What are your biggest concerns about this deployment?
    (These become focus areas for monitoring and validation)

Q7: Are there regulatory or compliance requirements?
    (e.g., HIPAA, SOC2, PCI-DSS - affects audit logging and security monitoring)

Q8: What's your incident response capability?
    (24/7 on-call? Business hours? Helps plan escalation and response)

Based on your answers, I'll adjust:
- Monitoring intensity (alert thresholds, dashboard focus)
- Agent assignments (add specialized monitoring agents)
- Exit criteria strictness (standard vs. elevated)
- Support team guidance level (detailed runbooks vs. minimal)

Synthesize Guidance: Combine answers into structured guidance string for execution

Artifacts to Generate

Primary Deliverables:

Hypercare Team Roster: Roles, on-call rotation, contacts →
```
.aiwg/deployment/hypercare-team-roster.md
```
Production Health Dashboard: Real-time monitoring config →
```
.aiwg/deployment/production-dashboard-config.md
```
Alert Escalation Matrix: Severity definitions and response SLAs →
```
.aiwg/deployment/alert-escalation-matrix.md
```
Daily Hypercare Standups: Status reports (daily) →
```
.aiwg/deployment/hypercare-standup-{YYYY-MM-DD}.md
```
Incident Response Logs: All P0/P1 incidents →
```
.aiwg/deployment/incidents/incident-{ID}.md
```
Risk Retirement Report: Validation evidence →
```
.aiwg/risks/hypercare-risk-validation.md
```
Hypercare Exit Report: Final status and transition plan →
```
.aiwg/reports/hypercare-exit-report.md
```

Supporting Artifacts:

SLO tracking logs (hourly updates)
User adoption metrics (daily updates)
Support ticket analysis (daily summary)
Post-incident reviews (PIRs) for all P0/P1
Corrective action tracker

Multi-Agent Orchestration Workflow

Step 1: Establish Hypercare Team and Schedule

Purpose: Create dedicated support structure with clear ownership and 24/7 coverage

Your Actions:

Read Deployment Context:

Read:
- .aiwg/deployment/operational-readiness-review.md (team assignments, contacts)
- .aiwg/deployment/slo-sli-definition.md (SLO targets, monitoring approach)
- .aiwg/deployment/incident-response-runbook.md (escalation paths)

Launch Hypercare Planning Agents (parallel):

# Agent 1: Operations Manager
Task(
    subagent_type="operations-manager",
    description="Create hypercare team roster and on-call rotation",
    prompt="""
    Read ORR team assignments and contacts

    Create Hypercare Team Roster:

    ## Core Team
    - Hypercare Lead: {name} (overall coordination, daily standups)
    - On-Call Engineers: {rotation-schedule} (24/7 coverage)
    - Reliability Engineer: {name} (SLO monitoring, performance analysis)
    - Support Lead: {name} (user-facing issues, ticket triage)
    - DevOps Engineer: {name} (rapid deployment, rollback authority)

    ## Extended Team
    - Product Owner: {name} (prioritization, user impact)
    - Security Gatekeeper: {name} (security incidents)
    - Component Owners: {list by component}

    Create 24/7 On-Call Rotation ({duration} days):
    - Primary on-call schedule (8-hour shifts or daily rotation)
    - Backup on-call contacts
    - Escalation path (P0/P1/P2/P3 response procedures)

    Schedule Daily Standups:
    - Time: {suggest optimal time}
    - Duration: 30 minutes
    - Attendees: Core team (mandatory), Extended team (optional)

    Save to: .aiwg/deployment/hypercare-team-roster.md
    """
)

# Agent 2: Reliability Engineer
Task(
    subagent_type="reliability-engineer",
    description="Configure production monitoring and alerting",
    prompt="""
    Read SLO/SLI definitions

    Configure Production Health Dashboard:

    ## Key Metrics (Auto-Refresh: 30s)

    **Availability**
    - Current Uptime: {percentage}% (Target: ≥99.9%)
    - Service Health: {GREEN | YELLOW | RED}
    - Failed Health Checks: {count}

    **Performance (Last 5 min)**
    - Response Time (p50/p95/p99): {value}ms
    - Throughput: {requests-per-second} req/s
    - Target: p95 < {SLA}ms

    **Errors (Last 5 min)**
    - Error Rate: {percentage}% (Target: <0.1%)
    - 4xx/5xx Errors: {count}
    - Database Errors: {count}

    **Business Metrics**
    - Active Users (Current): {count}
    - Successful Transactions: {count}
    - Transaction Success Rate: {percentage}%

    **Infrastructure**
    - CPU/Memory Utilization: {percentage}%
    - Disk I/O, Network Traffic

    Define alert thresholds for P0/P1/P2/P3 severity levels

    Save to: .aiwg/deployment/production-dashboard-config.md
    """
)

# Agent 3: Support Lead
Task(
    subagent_type="support-lead",
    description="Define alert escalation and incident response",
    prompt="""
    Read incident response runbook

    Create Alert Escalation Matrix:

    ## P0 (Critical) - Page Immediately
    - Availability <99%
    - Error rate >1%
    - All instances down
    - Security breach detected

    Action: Page on-call engineer + Hypercare Lead
    Response SLA: Immediate acknowledgment, 15 min time-to-engage

    ## P1 (High) - Alert Within 5 Minutes
    - Availability <99.5%
    - Error rate >0.5%
    - Response time p95 >2x SLA

    Action: Alert on-call engineer via Slack + SMS
    Response SLA: 30 min acknowledgment, 1 hour time-to-mitigation

    ## P2 (Medium) - Alert Within 30 Minutes
    - Availability <99.9%
    - Error rate >0.1%
    - Resource utilization >80%

    Action: Alert on-call engineer via Slack
    Response SLA: 4 hours

    ## P3 (Low) - Log and Review
    - Minor performance degradation
    - Non-critical errors

    Action: Create ticket for review
    Response SLA: 1 business day

    Document incident response workflow (5 phases):
    1. Detection (Target: <5 min)
    2. Triage (Target: <15 min)
    3. Investigation (P0=30min, P1=1h)
    4. Mitigation (P0=1h, P1=4h)
    5. Resolution (P0=2h, P1=8h)
    6. Post-Incident Review (Within 48h)

    Save to: .aiwg/deployment/alert-escalation-matrix.md
    """
)

Synthesize Hypercare Setup Plan:

# You do this directly (no agent needed)

Read all hypercare planning artifacts

Validate completeness:
- Team roster: All roles assigned?
- On-call rotation: 24/7 coverage confirmed?
- Monitoring: All SLOs tracked?
- Escalation: Response SLAs defined?

Create dedicated communication channel: #hypercare-{project-name}-{YYYY-MM}

Communicate Progress:

✓ Initialized hypercare setup
⏳ Establishing hypercare team and monitoring...
  ✓ Hypercare team roster created (Core + Extended teams)
  ✓ 24/7 on-call rotation scheduled ({duration} days)
  ✓ Production dashboard configured (5 metric categories)
  ✓ Alert escalation matrix defined (P0/P1/P2/P3)
✓ Hypercare infrastructure ready: .aiwg/deployment/

Step 2: Monitor Production Stability and SLOs (Daily)

Purpose: Continuously validate production system meets SLO targets and stability expectations

Your Actions:

Launch SLO Monitoring Agents (automated, repeat daily):

# Agent 1: Reliability Engineer (Daily SLO Report)
Task(
    subagent_type="reliability-engineer",
    description="Generate daily SLO compliance report",
    prompt="""
    Read production metrics from monitoring dashboard
    Read SLO definitions: .aiwg/deployment/slo-sli-definition.md

    Generate Daily SLO Report:

    ## SLO Tracking (Updated Hourly)

    ### Availability SLO
    - Target: ≥99.9% uptime
    - Current (24h): {percentage}%
    - Current (7d): {percentage}%
    - Error Budget Remaining: {percentage}%
    - Status: {ON TARGET | AT RISK | EXCEEDED}

    ### Performance SLO
    - Target: p95 response time <{value}ms
    - Current p95 (24h): {value}ms
    - Current p95 (7d): {value}ms
    - Status: {ON TARGET | AT RISK | EXCEEDED}

    ### Error Rate SLO
    - Target: <0.1% error rate
    - Current (24h): {percentage}%
    - Current (7d): {percentage}%
    - Status: {ON TARGET | AT RISK | EXCEEDED}

    ### Throughput SLO
    - Target: Handle {value} req/s
    - Current Peak: {value} req/s
    - Current Average: {value} req/s
    - Status: {ON TARGET | AT RISK | EXCEEDED}

    Calculate Error Budget Burn Rate:
    - Monthly error budget: {value} minutes downtime allowed
    - Hypercare period budget: {value} minutes
    - Current burn rate: {value} minutes consumed
    - Budget remaining: {percentage}%
    - Assessment: {HEALTHY | MONITOR | CRITICAL}

    If CRITICAL: Recommend incident freeze, focus on stability
    If MONITOR: Recommend increased monitoring, defer risky changes

    Save to: .aiwg/deployment/slo-report-{YYYY-MM-DD}.md
    """
)

# Agent 2: Support Lead (Daily Support Analysis)
Task(
    subagent_type="support-lead",
    description="Analyze user adoption and support tickets",
    prompt="""
    Read support ticket system
    Read user analytics

    Generate User Adoption Dashboard:

    ### Active Users
    - DAU (Daily Active Users): {count} (Target: >{target})
    - WAU/MAU: {count}
    - User Growth Rate: {+/-percentage}%

    ### Feature Adoption (New Features)
    For each new feature:
    - Total Users: {count}
    - Users Engaged: {count} ({percentage}%)
    - Adoption Rate: {percentage}% (Target: >{target}%)
    - Trend: {INCREASING | STABLE | DECREASING}

    ### Support Ticket Analysis
    - Total Tickets (24h): {count}
    - By Category: Bug Reports, How-To, Performance, etc.
    - Critical Issues: {count} (blockers)
    - Average Response Time: {value}h (Target: <{SLA}h)

    ### User Feedback Summary
    - Sentiment: {POSITIVE | NEUTRAL | NEGATIVE} ({percentage}%)
    - Top Issues: {list top 3}
    - Top Praises: {list top 3}

    Flag Critical User Blockers (if any)

    Save to: .aiwg/deployment/user-adoption-{YYYY-MM-DD}.md
    """
)

Incident Tracking (on-demand per incident):

# When incident detected:
Task(
    subagent_type="devops-engineer",
    description="Document and respond to incident {incident-ID}",
    prompt="""
    Incident detected: {incident-description}
    Severity: {P0 | P1 | P2 | P3}

    Follow Incident Response Workflow:

    1. Detection (<5 min):
       - Alert acknowledged
       - Initial severity assessment
       - Create incident channel: #incident-{YYYY-MM-DD}-{ID}

    2. Triage (<15 min):
       - Gather evidence (logs, metrics, user reports)
       - Identify affected systems/users
       - Estimate business impact
       - Engage Component Owners
       - Update severity if needed

    3. Investigation (P0=30min, P1=1h):
       - Review logs/metrics for root cause
       - Check recent deployments/changes
       - Reproduce in non-prod if possible
       - Identify probable root cause

    4. Mitigation (P0=1h, P1=4h):
       - Execute mitigation (rollback/hotfix/config change)
       - Validate effectiveness
       - Monitor for regression

    5. Resolution (P0=2h, P1=8h):
       - Confirm fully resolved
       - Validate SLOs back to normal
       - Close incident

    Document incident timeline and actions

    Save to: .aiwg/deployment/incidents/incident-{ID}.md

    If P0/P1: Schedule post-incident review within 48h
    """
)

Communicate Progress (daily update):

✓ Hypercare Day {N} of {duration}
⏳ Monitoring production stability...
  ✓ SLO compliance: {percentage}% of SLOs met (target: 100%)
  ✓ Incidents (24h): {count} total (P0: {count}, P1: {count}, P2: {count})
  ✓ User adoption: {percentage}% ({trend})
  ✓ Support tickets: {count} (Trend: {↑/→/↓})
✓ Daily reports: .aiwg/deployment/slo-report-{date}.md, user-adoption-{date}.md

Step 3: Conduct Daily Hypercare Standups

Purpose: Maintain team alignment, surface issues early, coordinate rapid response

Your Actions:

Generate Daily Standup Report (automated):

Task(
    subagent_type="operations-manager",
    description="Generate daily hypercare standup report",
    prompt="""
    Read daily reports:
    - .aiwg/deployment/slo-report-{YYYY-MM-DD}.md
    - .aiwg/deployment/user-adoption-{YYYY-MM-DD}.md
    - .aiwg/deployment/incidents/* (all open/recent incidents)

    Create Daily Standup Agenda:

    ## Hypercare Daily Standup - Day {N} of {duration}

    **Date**: {YYYY-MM-DD}
    **Facilitator**: {Hypercare Lead}

    ### 1. Production Health Review (5 min)
    **Presented by**: Reliability Engineer

    - Availability: {percentage}% (Target: ≥99.9%) - {STATUS}
    - Performance: p95 {value}ms (Target: <{SLA}ms) - {STATUS}
    - Error Rate: {percentage}% (Target: <0.1%) - {STATUS}
    - Error Budget: {percentage}% remaining - {STATUS}

    Overall Health: {GREEN | YELLOW | RED}

    ### 2. Incident Summary (Last 24h) (10 min)
    **Presented by**: On-Call Engineer

    Total Incidents: {count}
    - P0 (Critical): {count} - {list titles if any}
    - P1 (High): {count} - {list titles if any}
    - P2 (Medium): {count}
    - P3 (Low): {count}

    Key Incidents:
    For each P0/P1:
    - Incident-ID: {title}
    - Status: {Open/Resolved/Closed}
    - Impact: {user-count} users, {duration} minutes
    - Root Cause: {brief description}
    - Action Items: {list}

    Patterns/Trends: {emerging issues or recurring problems}

    ### 3. User Feedback Review (5 min)
    **Presented by**: Support Lead

    - Support Tickets (24h): {count} (Trend: {↑/→/↓})
    - Critical User Issues: {count}
    - Top Complaints: {list top 3}
    - Top Praises: {list top 3}
    - Sentiment: {POSITIVE | NEUTRAL | NEGATIVE}

    Blockers for Users: {list critical issues}

    ### 4. SLO/SLI Status (5 min)
    **Presented by**: Reliability Engineer

    | SLO | Target | Current (24h) | Status |
    |-----|--------|---------------|--------|
    | Availability | ≥99.9% | {percentage}% | {✓/⚠/✗} |
    | Response Time | p95<{value}ms | {value}ms | {✓/⚠/✗} |
    | Error Rate | <0.1% | {percentage}% | {✓/⚠/✗} |
    | Throughput | >{value} req/s | {value} req/s | {✓/⚠/✗} |

    ### 5. Action Items and Blockers (5 min)

    Open Action Items:
    | Action | Owner | Due Date | Status |
    |--------|-------|----------|--------|
    {list open actions}

    New Blockers:
    {list blockers requiring escalation}

    Tomorrow's On-Call: {name} (taking over at {HH:MM})

    ---

    Overall Status: {GREEN | YELLOW | RED}

    Key Decisions Made:
    {list decisions from standup}

    New Action Items:
    {list new actions assigned}

    Save to: .aiwg/deployment/hypercare-standup-{YYYY-MM-DD}.md
    """
)

Weekly Summary (if hypercare > 7 days):

# On Day 7, 14, etc.:
Task(
    subagent_type="operations-manager",
    description="Generate weekly hypercare summary",
    prompt="""
    Read all daily standups for week: .aiwg/deployment/hypercare-standup-*.md

    Create Weekly Summary:

    ## Hypercare Week {N} Summary

    **Week**: {date-range}
    **Overall Status**: {GREEN | YELLOW | RED}

    ### Production Stability
    - Availability: {percentage}% (Target: ≥99.9%)
    - Total Incidents: {count} (P0: {count}, P1: {count})
    - MTTR: {value} min
    - SLO Compliance: {percentage}%

    ### User Adoption
    - Active Users: {count} ({+/-percentage}% vs. previous week)
    - Feature Adoption: {percentage}%
    - User Sentiment: {POSITIVE | NEUTRAL | NEGATIVE}

    ### Support Health
    - Support Tickets: {count} ({+/-percentage}% vs. previous week)
    - Critical Issues: {count}
    - Response Time: {value}h (Target: <{SLA}h)

    ### Accomplishments
    {list accomplishments}

    ### Challenges
    {list challenges}

    ### Next Week Focus
    {list focus areas}

    Save to: .aiwg/reports/hypercare-week-{N}-summary.md
    """
)

Communicate Progress:

⏳ Conducting daily standup...
✓ Daily standup report generated: .aiwg/deployment/hypercare-standup-{date}.md
  - Overall Health: {GREEN | YELLOW | RED}
  - Key Decisions: {count}
  - New Action Items: {count}
  - Escalations: {count}

Step 4: Post-Incident Reviews (For P0/P1 Incidents)

Purpose: Document root cause and corrective actions for all critical incidents

Your Actions:

For Each P0/P1 Incident (within 48h of resolution):

Task(
    subagent_type="reliability-engineer",
    description="Conduct post-incident review for {incident-ID}",
    prompt="""
    Read incident log: .aiwg/deployment/incidents/incident-{ID}.md

    Create Post-Incident Review (PIR):

    ## Post-Incident Review: {Incident-ID}

    **Date**: {YYYY-MM-DD}
    **Severity**: {P0/P1/P2/P3}
    **Duration**: {detection-to-resolution}
    **Impact**: {user-count} users, {downtime-minutes} minutes downtime

    ### Incident Summary
    {1-2 sentence description of what happened}

    ### Timeline
    | Time | Event | Actor |
    |------|-------|-------|
    {incident timeline from detection to resolution}

    ### Root Cause
    {Detailed technical root cause analysis}

    ### Contributing Factors
    1. {Factor 1 - e.g., insufficient testing}
    2. {Factor 2 - e.g., monitoring gap}
    3. {Factor 3 - e.g., unclear runbook}

    ### Corrective Actions
    | Action | Owner | Due Date | Status |
    |--------|-------|----------|--------|
    {list corrective actions to prevent recurrence}

    ### Lessons Learned
    - What went well: {list}
    - What could improve: {list}
    - Process changes needed: {list}

    Save to: .aiwg/deployment/incidents/pir-{ID}.md

    Update incident log with PIR link
    Track corrective actions in action tracker
    """
)

Communicate Progress:

⏳ Conducting post-incident reviews...
✓ PIR complete: Incident-{ID} ({title})
  - Root cause: {summary}
  - Corrective actions: {count} assigned
  - Status: Tracking to completion

Step 5: Validate Exit Criteria and Generate Hypercare Report

Purpose: Ensure production is stable and support team is ready before ending hypercare

Your Actions:

Validate Exit Criteria (on final day or when user requests):

Task(
    subagent_type="operations-manager",
    description="Validate hypercare exit criteria",
    prompt="""
    Read all hypercare artifacts

    Validate Hypercare Exit Criteria:

    ## Hypercare Exit Criteria Validation

    **Hypercare Period**: Day {N} of {duration}
    **Validation Date**: {YYYY-MM-DD}

    ### Production Stability
    - [ ] Zero P0 (Critical) incidents in last 48 hours
    - [ ] Zero P1 (High) incidents in last 24 hours
    - [ ] All SLOs met for 72 consecutive hours
      - [ ] Availability ≥99.9%
      - [ ] Response time p95 <{SLA}ms
      - [ ] Error rate <0.1%
      - [ ] Throughput >{target} req/s
    - [ ] Error budget healthy: >{percentage}% remaining
    - [ ] No open P0/P1 incidents

    ### User Adoption
    - [ ] User adoption trending positive ({percentage}% growth)
    - [ ] Feature adoption >{target}% for critical features
    - [ ] User sentiment majority positive (≥70%)
    - [ ] Support ticket volume stable or decreasing
    - [ ] No critical user blockers unresolved

    ### Support Readiness
    - [ ] Support team trained and confident
    - [ ] Runbooks validated (all common issues documented)
    - [ ] Escalation paths tested and effective
    - [ ] Knowledge base updated with hypercare learnings
    - [ ] On-call rotation transitioned to standard support

    ### Documentation Complete
    - [ ] Hypercare report completed
    - [ ] Post-incident reviews completed (all P0/P1)
    - [ ] Corrective actions tracked (assigned, due dates set)
    - [ ] Lessons learned documented
    - [ ] Runbooks updated

    Overall Exit Criteria Status: {PASS | CONDITIONAL | FAIL}
    Decision: {END HYPERCARE | EXTEND HYPERCARE | ESCALATE}

    Save to: .aiwg/reports/hypercare-exit-criteria.md
    """
)

Generate Hypercare Exit Report (comprehensive final report):

Task(
    subagent_type="operations-manager",
    description="Generate comprehensive hypercare exit report",
    prompt="""
    Read all hypercare artifacts:
    - .aiwg/deployment/hypercare-team-roster.md
    - .aiwg/deployment/slo-report-*.md (all days)
    - .aiwg/deployment/user-adoption-*.md (all days)
    - .aiwg/deployment/hypercare-standup-*.md (all days)
    - .aiwg/deployment/incidents/*.md (all incidents)
    - .aiwg/reports/hypercare-exit-criteria.md

    Generate Hypercare Exit Report:

    # Hypercare Report: {Project-Name}

    **Hypercare Period**: {start-date} to {end-date} ({duration} days)
    **Report Date**: {YYYY-MM-DD}
    **Report Author**: {Hypercare Lead}

    ## Executive Summary

    {2-3 sentence summary of hypercare outcomes}

    Overall Status: {SUCCESS | SUCCESS WITH CONDITIONS | CHALLENGES}

    Key Metrics:
    - Availability: {percentage}%
    - Total Incidents: {count} (P0: {count}, P1: {count})
    - User Adoption: {percentage}%
    - Support Tickets: {count}

    ## Production Stability Summary

    ### SLO Performance
    | SLO | Target | Achieved | Status |
    |-----|--------|----------|--------|
    {SLO compliance table}

    SLO Compliance Rate: {percentage}%

    ### Incident Summary
    Total Incidents: {count}
    - P0 (Critical): {count}
    - P1 (High): {count}
    - P2 (Medium): {count}
    - P3 (Low): {count}

    Key Metrics:
    - MTTD (Mean Time to Detect): {value} min
    - MTTA (Mean Time to Acknowledge): {value} min
    - MTTR (Mean Time to Resolve): {value} min

    Major Incidents:
    For each P0/P1:
    - Incident-ID: {title}
      - Date, Duration, Impact, Root Cause, Resolution
      - Corrective Actions: {count} assigned

    ### Performance Trends
    - Response Time: {IMPROVED | STABLE | DEGRADED} ({+/-percentage}% vs. pre-deployment)
    - Error Rate: {IMPROVED | STABLE | DEGRADED}
    - Resource Utilization: {HEALTHY | CONCERNING}

    ## User Adoption Summary

    ### Adoption Metrics
    - Active Users: {count} ({+/-percentage}% vs. pre-deployment)
    - Feature Adoption: {percentage}% (Target: >{target}%)
    - User Retention (Day 14): {percentage}%

    ### User Feedback
    - Total Feedback Items: {count}
    - Sentiment: {percentage}% positive
    - Net Promoter Score: {value}

    Top Praises: {list top 3}
    Top Complaints: {list top 3 with resolution status}

    ## Support Summary

    ### Ticket Volume
    - Total Support Tickets: {count}
    - Daily Average: {count} tickets/day
    - Trend: {DECREASING | STABLE | INCREASING}

    ### Support Performance
    - Average Response Time: {value}h (Target: <{SLA}h) - {✓/⚠/✗}
    - First Contact Resolution: {percentage}%

    ### Support Team Readiness
    - Team Confidence Level: {HIGH | MEDIUM | LOW}
    - Runbook Completeness: {percentage}%

    ## Lessons Learned

    ### What Went Well
    {list successes}

    ### What Could Improve
    {list improvements}

    ### Process Recommendations
    {list recommendations for future deployments}

    ## Corrective Actions

    Total Actions Identified: {count}

    | Action | Category | Owner | Due Date | Status |
    |--------|----------|-------|----------|--------|
    {corrective actions table}

    ## Handover to Standard Support

    ### Transition Plan
    - [ ] Standard on-call rotation activated (starting {date})
    - [ ] Support runbooks transferred
    - [ ] Knowledge base published
    - [ ] Support team training complete
    - [ ] Escalation paths updated for BAU

    ### Post-Hypercare Monitoring
    - Duration: {duration} days continued close monitoring
    - Responsible: {Support Lead}
    - Review Cadence: Weekly check-ins for {duration} weeks

    ## Conclusion

    {2-3 sentence summary and readiness for standard support}

    Recommendation: {END HYPERCARE | EXTEND HYPERCARE}

    Signoff:
    - Hypercare Lead: {name} - {date}
    - Reliability Engineer: {name} - {date}
    - Support Lead: {name} - {date}
    - Product Owner: {name} - {date}
    - Project Manager: {name} - {date}

    Save to: .aiwg/reports/hypercare-exit-report.md
    """
)

Present Exit Summary to User:

# You present this directly (not via agent)

Read .aiwg/reports/hypercare-exit-report.md

Present summary:
─────────────────────────────────────────────
Hypercare Monitoring Period Complete
─────────────────────────────────────────────

**Hypercare Period**: {start-date} to {end-date} ({duration} days)
**Overall Status**: {SUCCESS | SUCCESS WITH CONDITIONS | CHALLENGES}

**Key Metrics**:
✓ Availability: {percentage}% (Target: ≥99.9%)
✓ Total Incidents: {count} (P0: {count}, P1: {count})
✓ User Adoption: {percentage}% of target
✓ Support Readiness: Team confident and ready

**Exit Criteria Status**:
✓ Production Stability: {PASS | CONDITIONAL | FAIL}
✓ User Adoption: {PASS | CONDITIONAL | FAIL}
✓ Support Readiness: {PASS | CONDITIONAL | FAIL}
✓ Documentation: {PASS | CONDITIONAL | FAIL}

**Decision**: {END HYPERCARE | EXTEND HYPERCARE | ESCALATE}

**Artifacts Generated**:
- Hypercare Team Roster (.aiwg/deployment/hypercare-team-roster.md)
- Production Dashboard Config (.aiwg/deployment/production-dashboard-config.md)
- Alert Escalation Matrix (.aiwg/deployment/alert-escalation-matrix.md)
- Daily Standup Reports (.aiwg/deployment/hypercare-standup-*.md, {count} files)
- SLO Reports (.aiwg/deployment/slo-report-*.md, {count} files)
- User Adoption Reports (.aiwg/deployment/user-adoption-*.md, {count} files)
- Incident Logs (.aiwg/deployment/incidents/*.md, {count} files)
- Post-Incident Reviews (.aiwg/deployment/incidents/pir-*.md, {count} files)
- Hypercare Exit Report (.aiwg/reports/hypercare-exit-report.md)

**Next Steps**:
- Review hypercare exit report with stakeholders
- Obtain formal signoffs (5 required signatures)
- If END HYPERCARE: Transition to standard support (run handoff workflow)
- If EXTEND HYPERCARE: Address gaps, continue monitoring
- If ESCALATE: Executive decision required

**Transition to Standard Support**:
- Standard on-call rotation activated: {date}
- Continued monitoring period: {duration} days
- Weekly check-ins scheduled

─────────────────────────────────────────────

Communicate Progress:

⏳ Validating hypercare exit criteria...
✓ Exit criteria validated: {PASS | CONDITIONAL | FAIL}
✓ Hypercare Exit Report generated: .aiwg/reports/hypercare-exit-report.md
✓ Transition plan documented

Quality Gates

Before marking workflow complete, verify:

Hypercare team established with 24/7 coverage
Production monitoring operational (dashboards, alerts)
Daily standups conducted and documented
All P0/P1 incidents have post-incident reviews
SLO compliance tracked daily
User adoption monitored and reported
Exit criteria validated
Hypercare exit report complete and approved
Transition to standard support planned

User Communication

At start: Confirm understanding and list activities

Understood. I'll orchestrate the hypercare monitoring period.

Hypercare Duration: {duration} days
Hypercare Period: {start-date} to {estimated-end-date}

This will establish:
- Hypercare team roster and 24/7 on-call rotation
- Production health monitoring dashboards
- Alert escalation and incident response procedures
- Daily standup coordination
- SLO tracking and user adoption monitoring
- Post-incident review process
- Hypercare exit criteria validation

I'll coordinate multiple agents for comprehensive monitoring and support.
Expected setup: 20-30 minutes.

Starting orchestration...

During: Update progress with clear indicators

✓ = Complete
⏳ = In progress
❌ = Error/blocked
⚠️ = Warning/attention needed

Daily: Provide daily status summary

Hypercare Day {N} of {duration}: {GREEN | YELLOW | RED}

Production Health:
✓ Availability: {percentage}% (Target: ≥99.9%)
✓ Performance: p95 {value}ms (Target: <{SLA}ms)
{⚠️ | ✓} Error Rate: {percentage}% (Target: <0.1%)

Incidents (24h):
- P0: {count}
- P1: {count}
- P2: {count}

User Adoption: {percentage}% ({trend})

Daily reports: .aiwg/deployment/hypercare-standup-{date}.md

At end: Summary report (see Step 5.3 above)

Error Handling

If P0 Incident During Hypercare:

❌ Critical incident detected - immediate response initiated

Incident: {incident-ID} - {title}
Severity: P0 (Complete outage / Data loss / Security breach)
Impact: {user-count} users affected

Actions:
1. On-call engineer + Hypercare Lead paged
2. Incident war room created: #incident-{date}-{ID}
3. Executive Sponsor notified
4. Status page updated

Response Timeline:
- Detection: {timestamp}
- Acknowledgment: {timestamp} (Target: Immediate)
- Time-to-engage: {minutes} min (Target: <15 min)

Current Status: {INVESTIGATING | MITIGATING | RESOLVED}

Impact on Exit Criteria: P0 incident resets 48h "zero critical incidents" requirement

Monitoring incident response...

If SLO Breach:

⚠️ SLO breach detected - immediate investigation required

SLO Breached: {SLO-name}
- Target: {target-value}
- Current: {actual-value}
- Duration: {duration} (continuous breach)

Impact:
- Error budget consumed: {percentage}%
- User impact: {description}

Actions:
1. Reliability Engineer investigating root cause
2. Metrics and logs under review
3. Mitigation plan in progress

If breach persists >24h: Recommend extending hypercare period
If error budget critically low: Recommend incident freeze

Monitoring for improvement...

If User Adoption Low:

⚠️ User adoption below target

Current Adoption: {percentage}% (Target: >{target}%)
Gap: {percentage} points

Analysis:
- Top User Issues: {list issues}
- Support Ticket Themes: {list themes}
- Potential Blockers: {list blockers}

Actions:
1. Product Owner engaged for adoption analysis
2. Support team reviewing common user issues
3. Documentation and training gaps identified

Decision Point:
- If blockers identified: Prioritize fixes, may extend hypercare
- If education needed: Launch awareness campaign
- If feature not valuable: Escalate to stakeholders

Impact on Exit Criteria: User adoption trend must improve before exit approval

If Support Team Overwhelmed:

⚠️ Support team capacity exceeded

Support Volume: {count} tickets/day (Capacity: {capacity})
Team Status: {STRESSED | OVERWHELMED}

Root Cause Analysis:
- Top Issue Categories: {list categories with counts}
- Product Bugs vs User Education: {ratio}

Immediate Relief Actions:
1. Additional support staff brought in (temp)
2. Engineering team handling overflow tickets
3. Workarounds created for top issues
4. FAQ and self-service guides published

Mitigation:
- Deploy hotfixes for high-frequency bugs
- Update documentation for common questions
- Additional training sessions scheduled

Impact on Exit Criteria: Support team must be confident and staffed before exit

If Exit Criteria Not Met:

⚠️ Hypercare exit criteria not met - extension recommended

Exit Criteria Status: {FAIL | CONDITIONAL}

Gaps Identified:
{list unmet criteria with details}

Recommendation: {EXTEND HYPERCARE | CONDITIONAL EXIT | ESCALATE}

Extension Plan:
- Additional Duration: {days} days
- Focus Areas: {list areas needing improvement}
- Re-validation Date: {date}

Escalating to user for decision...

Success Criteria

This orchestration succeeds when:

Metrics to Track

During orchestration, track:

SLO compliance rate: % of SLOs met (target: 100% for 72h before exit)
Incident frequency: # of P0/P1/P2/P3 incidents (target: P0/P1 = 0 in final 48/24h)
Mean time to detect (MTTD): Minutes from incident to detection (target: <5 min)
Mean time to resolve (MTTR): Minutes from detection to resolution (target: P0 <120 min, P1 <480 min)
Error budget burn rate: % of monthly budget consumed (target: <50% during hypercare)
User adoption rate: % of target users actively engaged (target: ≥70%)
Support ticket volume: # of tickets/day (target: decreasing trend)
Support response time: Hours to first response (target: <SLA)

References

Templates (via $AIWG_ROOT):

Operational Readiness Review:

templates/deployment/operational-readiness-review-template.md

SLO/SLI Definition:

templates/deployment/slo-sli-template.md

Incident Response Runbook:

templates/support/incident-response-runbook-template.md

Support Plan:

templates/support/support-plan-template.md

Related Flows:

Gate Check:
```
commands/flow-gate-check.md
```
Handoff Checklist:
```
commands/flow-handoff-checklist.md
```
Deployment Workflow:
```
commands/flow-deployment-workflow.md
```

SDLC Phase Context:

Phase: Transition (Deployment → Operations)
Milestone: Hypercare Complete (transition to BAU support)