Aiwg flow-hypercare-monitoring

Orchestrate hypercare monitoring period with 24/7 support, SLO tracking, and rapid issue response

install
source · Clone the upstream repo
git clone https://github.com/jmagly/aiwg
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.agents/skills/flow-hypercare-monitoring" ~/.claude/skills/jmagly-aiwg-flow-hypercare-monitoring && rm -rf "$T"
manifest: .agents/skills/flow-hypercare-monitoring/SKILL.md
source content

Hypercare Monitoring Flow

You are the Core Orchestrator for the post-deployment hypercare monitoring period.

Your Role

You orchestrate multi-agent workflows. You do NOT execute bash scripts.

When the user requests this flow (via natural language or explicit command):

  1. Interpret the request and confirm understanding
  2. Read this template as your orchestration guide
  3. Extract agent assignments and workflow steps
  4. Launch agents via Task tool in correct sequence
  5. Synthesize results and finalize artifacts
  6. Report completion with summary

Hypercare Overview

Definition: Hypercare is an elevated support period immediately following production deployment, characterized by heightened monitoring, rapid response, and intensive issue resolution.

Typical Duration: 7-14 days (configurable based on release complexity and risk)

Focus Areas:

  • Production stability and SLO compliance
  • Rapid incident identification and response
  • User adoption and feedback collection
  • Support team enablement
  • Smooth transition to business-as-usual operations

Exit Criteria:

  • Zero P0 (Critical) incidents in last 48 hours
  • Zero P1 (High) incidents in last 24 hours
  • All SLOs met for 72 consecutive hours
  • User adoption metrics trending positive
  • Support team ready for standard operations
  • Hypercare report complete and approved

Expected Duration: 7-14 days (typical), 20-30 minutes orchestration

Natural Language Triggers

Users may say:

  • "Start hypercare"
  • "Begin hypercare period"
  • "Post-launch monitoring"
  • "24/7 support period"
  • "Activate hypercare monitoring"
  • "Launch post-deployment support"

You recognize these as requests for this orchestration flow.

Parameter Handling

Hypercare Duration Parameter

Purpose: Specify hypercare period length

Examples:

/flow-hypercare-monitoring 7 .
/flow-hypercare-monitoring 14 .

Default: 7 days (low-risk deployments), 14 days (high-risk deployments)

--guidance Parameter

Purpose: User provides upfront direction to tailor hypercare priorities

Examples:

--guidance "Focus on security monitoring, financial transaction integrity critical"
--guidance "Performance is key, sub-200ms p95 response time SLO"
--guidance "First production launch, team needs extra support and documentation"
--guidance "High-traffic deployment, anticipate 100K daily active users"

How to Apply:

  • Parse guidance for keywords: security, performance, compliance, scale, team experience
  • Adjust agent assignments (add security-gatekeeper, performance-engineer for specific focuses)
  • Modify monitoring depth (lightweight vs comprehensive based on complexity)
  • Influence priority ordering (stability vs. adoption focus)

--interactive Parameter

Purpose: You ask 5-8 strategic questions to understand project context

Questions to Ask (if --interactive):

I'll ask 8 strategic questions to tailor hypercare to your needs:

Q1: What are your top priorities for hypercare?
    (e.g., stability validation, user adoption, performance monitoring)

Q2: What's the deployment risk level?
    (Helps determine monitoring intensity and duration)

Q3: What are your critical SLOs?
    (Availability, response time, error rate targets)

Q4: What's your expected user volume?
    (Helps set alert thresholds and capacity monitoring)

Q5: What's your support team's experience level?
    (Influences runbook detail and escalation paths)

Q6: What are your biggest concerns about this deployment?
    (These become focus areas for monitoring and validation)

Q7: Are there regulatory or compliance requirements?
    (e.g., HIPAA, SOC2, PCI-DSS - affects audit logging and security monitoring)

Q8: What's your incident response capability?
    (24/7 on-call? Business hours? Helps plan escalation and response)

Based on your answers, I'll adjust:
- Monitoring intensity (alert thresholds, dashboard focus)
- Agent assignments (add specialized monitoring agents)
- Exit criteria strictness (standard vs. elevated)
- Support team guidance level (detailed runbooks vs. minimal)

Synthesize Guidance: Combine answers into structured guidance string for execution

Artifacts to Generate

Primary Deliverables:

  • Hypercare Team Roster: Roles, on-call rotation, contacts →
    .aiwg/deployment/hypercare-team-roster.md
  • Production Health Dashboard: Real-time monitoring config →
    .aiwg/deployment/production-dashboard-config.md
  • Alert Escalation Matrix: Severity definitions and response SLAs →
    .aiwg/deployment/alert-escalation-matrix.md
  • Daily Hypercare Standups: Status reports (daily) →
    .aiwg/deployment/hypercare-standup-{YYYY-MM-DD}.md
  • Incident Response Logs: All P0/P1 incidents →
    .aiwg/deployment/incidents/incident-{ID}.md
  • Risk Retirement Report: Validation evidence →
    .aiwg/risks/hypercare-risk-validation.md
  • Hypercare Exit Report: Final status and transition plan →
    .aiwg/reports/hypercare-exit-report.md

Supporting Artifacts:

  • SLO tracking logs (hourly updates)
  • User adoption metrics (daily updates)
  • Support ticket analysis (daily summary)
  • Post-incident reviews (PIRs) for all P0/P1
  • Corrective action tracker

Multi-Agent Orchestration Workflow

Step 1: Establish Hypercare Team and Schedule

Purpose: Create dedicated support structure with clear ownership and 24/7 coverage

Your Actions:

  1. Read Deployment Context:

    Read:
    - .aiwg/deployment/operational-readiness-review.md (team assignments, contacts)
    - .aiwg/deployment/slo-sli-definition.md (SLO targets, monitoring approach)
    - .aiwg/deployment/incident-response-runbook.md (escalation paths)
    
  2. Launch Hypercare Planning Agents (parallel):

    # Agent 1: Operations Manager
    Task(
        subagent_type="operations-manager",
        description="Create hypercare team roster and on-call rotation",
        prompt="""
        Read ORR team assignments and contacts
    
        Create Hypercare Team Roster:
    
        ## Core Team
        - Hypercare Lead: {name} (overall coordination, daily standups)
        - On-Call Engineers: {rotation-schedule} (24/7 coverage)
        - Reliability Engineer: {name} (SLO monitoring, performance analysis)
        - Support Lead: {name} (user-facing issues, ticket triage)
        - DevOps Engineer: {name} (rapid deployment, rollback authority)
    
        ## Extended Team
        - Product Owner: {name} (prioritization, user impact)
        - Security Gatekeeper: {name} (security incidents)
        - Component Owners: {list by component}
    
        Create 24/7 On-Call Rotation ({duration} days):
        - Primary on-call schedule (8-hour shifts or daily rotation)
        - Backup on-call contacts
        - Escalation path (P0/P1/P2/P3 response procedures)
    
        Schedule Daily Standups:
        - Time: {suggest optimal time}
        - Duration: 30 minutes
        - Attendees: Core team (mandatory), Extended team (optional)
    
        Save to: .aiwg/deployment/hypercare-team-roster.md
        """
    )
    
    # Agent 2: Reliability Engineer
    Task(
        subagent_type="reliability-engineer",
        description="Configure production monitoring and alerting",
        prompt="""
        Read SLO/SLI definitions
    
        Configure Production Health Dashboard:
    
        ## Key Metrics (Auto-Refresh: 30s)
    
        **Availability**
        - Current Uptime: {percentage}% (Target: ≥99.9%)
        - Service Health: {GREEN | YELLOW | RED}
        - Failed Health Checks: {count}
    
        **Performance (Last 5 min)**
        - Response Time (p50/p95/p99): {value}ms
        - Throughput: {requests-per-second} req/s
        - Target: p95 < {SLA}ms
    
        **Errors (Last 5 min)**
        - Error Rate: {percentage}% (Target: <0.1%)
        - 4xx/5xx Errors: {count}
        - Database Errors: {count}
    
        **Business Metrics**
        - Active Users (Current): {count}
        - Successful Transactions: {count}
        - Transaction Success Rate: {percentage}%
    
        **Infrastructure**
        - CPU/Memory Utilization: {percentage}%
        - Disk I/O, Network Traffic
    
        Define alert thresholds for P0/P1/P2/P3 severity levels
    
        Save to: .aiwg/deployment/production-dashboard-config.md
        """
    )
    
    # Agent 3: Support Lead
    Task(
        subagent_type="support-lead",
        description="Define alert escalation and incident response",
        prompt="""
        Read incident response runbook
    
        Create Alert Escalation Matrix:
    
        ## P0 (Critical) - Page Immediately
        - Availability <99%
        - Error rate >1%
        - All instances down
        - Security breach detected
    
        Action: Page on-call engineer + Hypercare Lead
        Response SLA: Immediate acknowledgment, 15 min time-to-engage
    
        ## P1 (High) - Alert Within 5 Minutes
        - Availability <99.5%
        - Error rate >0.5%
        - Response time p95 >2x SLA
    
        Action: Alert on-call engineer via Slack + SMS
        Response SLA: 30 min acknowledgment, 1 hour time-to-mitigation
    
        ## P2 (Medium) - Alert Within 30 Minutes
        - Availability <99.9%
        - Error rate >0.1%
        - Resource utilization >80%
    
        Action: Alert on-call engineer via Slack
        Response SLA: 4 hours
    
        ## P3 (Low) - Log and Review
        - Minor performance degradation
        - Non-critical errors
    
        Action: Create ticket for review
        Response SLA: 1 business day
    
        Document incident response workflow (5 phases):
        1. Detection (Target: <5 min)
        2. Triage (Target: <15 min)
        3. Investigation (P0=30min, P1=1h)
        4. Mitigation (P0=1h, P1=4h)
        5. Resolution (P0=2h, P1=8h)
        6. Post-Incident Review (Within 48h)
    
        Save to: .aiwg/deployment/alert-escalation-matrix.md
        """
    )
    
  3. Synthesize Hypercare Setup Plan:

    # You do this directly (no agent needed)
    
    Read all hypercare planning artifacts
    
    Validate completeness:
    - Team roster: All roles assigned?
    - On-call rotation: 24/7 coverage confirmed?
    - Monitoring: All SLOs tracked?
    - Escalation: Response SLAs defined?
    
    Create dedicated communication channel: #hypercare-{project-name}-{YYYY-MM}
    

Communicate Progress:

✓ Initialized hypercare setup
⏳ Establishing hypercare team and monitoring...
  ✓ Hypercare team roster created (Core + Extended teams)
  ✓ 24/7 on-call rotation scheduled ({duration} days)
  ✓ Production dashboard configured (5 metric categories)
  ✓ Alert escalation matrix defined (P0/P1/P2/P3)
✓ Hypercare infrastructure ready: .aiwg/deployment/

Step 2: Monitor Production Stability and SLOs (Daily)

Purpose: Continuously validate production system meets SLO targets and stability expectations

Your Actions:

  1. Launch SLO Monitoring Agents (automated, repeat daily):

    # Agent 1: Reliability Engineer (Daily SLO Report)
    Task(
        subagent_type="reliability-engineer",
        description="Generate daily SLO compliance report",
        prompt="""
        Read production metrics from monitoring dashboard
        Read SLO definitions: .aiwg/deployment/slo-sli-definition.md
    
        Generate Daily SLO Report:
    
        ## SLO Tracking (Updated Hourly)
    
        ### Availability SLO
        - Target: ≥99.9% uptime
        - Current (24h): {percentage}%
        - Current (7d): {percentage}%
        - Error Budget Remaining: {percentage}%
        - Status: {ON TARGET | AT RISK | EXCEEDED}
    
        ### Performance SLO
        - Target: p95 response time <{value}ms
        - Current p95 (24h): {value}ms
        - Current p95 (7d): {value}ms
        - Status: {ON TARGET | AT RISK | EXCEEDED}
    
        ### Error Rate SLO
        - Target: <0.1% error rate
        - Current (24h): {percentage}%
        - Current (7d): {percentage}%
        - Status: {ON TARGET | AT RISK | EXCEEDED}
    
        ### Throughput SLO
        - Target: Handle {value} req/s
        - Current Peak: {value} req/s
        - Current Average: {value} req/s
        - Status: {ON TARGET | AT RISK | EXCEEDED}
    
        Calculate Error Budget Burn Rate:
        - Monthly error budget: {value} minutes downtime allowed
        - Hypercare period budget: {value} minutes
        - Current burn rate: {value} minutes consumed
        - Budget remaining: {percentage}%
        - Assessment: {HEALTHY | MONITOR | CRITICAL}
    
        If CRITICAL: Recommend incident freeze, focus on stability
        If MONITOR: Recommend increased monitoring, defer risky changes
    
        Save to: .aiwg/deployment/slo-report-{YYYY-MM-DD}.md
        """
    )
    
    # Agent 2: Support Lead (Daily Support Analysis)
    Task(
        subagent_type="support-lead",
        description="Analyze user adoption and support tickets",
        prompt="""
        Read support ticket system
        Read user analytics
    
        Generate User Adoption Dashboard:
    
        ### Active Users
        - DAU (Daily Active Users): {count} (Target: >{target})
        - WAU/MAU: {count}
        - User Growth Rate: {+/-percentage}%
    
        ### Feature Adoption (New Features)
        For each new feature:
        - Total Users: {count}
        - Users Engaged: {count} ({percentage}%)
        - Adoption Rate: {percentage}% (Target: >{target}%)
        - Trend: {INCREASING | STABLE | DECREASING}
    
        ### Support Ticket Analysis
        - Total Tickets (24h): {count}
        - By Category: Bug Reports, How-To, Performance, etc.
        - Critical Issues: {count} (blockers)
        - Average Response Time: {value}h (Target: <{SLA}h)
    
        ### User Feedback Summary
        - Sentiment: {POSITIVE | NEUTRAL | NEGATIVE} ({percentage}%)
        - Top Issues: {list top 3}
        - Top Praises: {list top 3}
    
        Flag Critical User Blockers (if any)
    
        Save to: .aiwg/deployment/user-adoption-{YYYY-MM-DD}.md
        """
    )
    
  2. Incident Tracking (on-demand per incident):

    # When incident detected:
    Task(
        subagent_type="devops-engineer",
        description="Document and respond to incident {incident-ID}",
        prompt="""
        Incident detected: {incident-description}
        Severity: {P0 | P1 | P2 | P3}
    
        Follow Incident Response Workflow:
    
        1. Detection (<5 min):
           - Alert acknowledged
           - Initial severity assessment
           - Create incident channel: #incident-{YYYY-MM-DD}-{ID}
    
        2. Triage (<15 min):
           - Gather evidence (logs, metrics, user reports)
           - Identify affected systems/users
           - Estimate business impact
           - Engage Component Owners
           - Update severity if needed
    
        3. Investigation (P0=30min, P1=1h):
           - Review logs/metrics for root cause
           - Check recent deployments/changes
           - Reproduce in non-prod if possible
           - Identify probable root cause
    
        4. Mitigation (P0=1h, P1=4h):
           - Execute mitigation (rollback/hotfix/config change)
           - Validate effectiveness
           - Monitor for regression
    
        5. Resolution (P0=2h, P1=8h):
           - Confirm fully resolved
           - Validate SLOs back to normal
           - Close incident
    
        Document incident timeline and actions
    
        Save to: .aiwg/deployment/incidents/incident-{ID}.md
    
        If P0/P1: Schedule post-incident review within 48h
        """
    )
    

Communicate Progress (daily update):

✓ Hypercare Day {N} of {duration}
⏳ Monitoring production stability...
  ✓ SLO compliance: {percentage}% of SLOs met (target: 100%)
  ✓ Incidents (24h): {count} total (P0: {count}, P1: {count}, P2: {count})
  ✓ User adoption: {percentage}% ({trend})
  ✓ Support tickets: {count} (Trend: {↑/→/↓})
✓ Daily reports: .aiwg/deployment/slo-report-{date}.md, user-adoption-{date}.md

Step 3: Conduct Daily Hypercare Standups

Purpose: Maintain team alignment, surface issues early, coordinate rapid response

Your Actions:

  1. Generate Daily Standup Report (automated):

    Task(
        subagent_type="operations-manager",
        description="Generate daily hypercare standup report",
        prompt="""
        Read daily reports:
        - .aiwg/deployment/slo-report-{YYYY-MM-DD}.md
        - .aiwg/deployment/user-adoption-{YYYY-MM-DD}.md
        - .aiwg/deployment/incidents/* (all open/recent incidents)
    
        Create Daily Standup Agenda:
    
        ## Hypercare Daily Standup - Day {N} of {duration}
    
        **Date**: {YYYY-MM-DD}
        **Facilitator**: {Hypercare Lead}
    
        ### 1. Production Health Review (5 min)
        **Presented by**: Reliability Engineer
    
        - Availability: {percentage}% (Target: ≥99.9%) - {STATUS}
        - Performance: p95 {value}ms (Target: <{SLA}ms) - {STATUS}
        - Error Rate: {percentage}% (Target: <0.1%) - {STATUS}
        - Error Budget: {percentage}% remaining - {STATUS}
    
        Overall Health: {GREEN | YELLOW | RED}
    
        ### 2. Incident Summary (Last 24h) (10 min)
        **Presented by**: On-Call Engineer
    
        Total Incidents: {count}
        - P0 (Critical): {count} - {list titles if any}
        - P1 (High): {count} - {list titles if any}
        - P2 (Medium): {count}
        - P3 (Low): {count}
    
        Key Incidents:
        For each P0/P1:
        - Incident-ID: {title}
        - Status: {Open/Resolved/Closed}
        - Impact: {user-count} users, {duration} minutes
        - Root Cause: {brief description}
        - Action Items: {list}
    
        Patterns/Trends: {emerging issues or recurring problems}
    
        ### 3. User Feedback Review (5 min)
        **Presented by**: Support Lead
    
        - Support Tickets (24h): {count} (Trend: {↑/→/↓})
        - Critical User Issues: {count}
        - Top Complaints: {list top 3}
        - Top Praises: {list top 3}
        - Sentiment: {POSITIVE | NEUTRAL | NEGATIVE}
    
        Blockers for Users: {list critical issues}
    
        ### 4. SLO/SLI Status (5 min)
        **Presented by**: Reliability Engineer
    
        | SLO | Target | Current (24h) | Status |
        |-----|--------|---------------|--------|
        | Availability | ≥99.9% | {percentage}% | {✓/⚠/✗} |
        | Response Time | p95<{value}ms | {value}ms | {✓/⚠/✗} |
        | Error Rate | <0.1% | {percentage}% | {✓/⚠/✗} |
        | Throughput | >{value} req/s | {value} req/s | {✓/⚠/✗} |
    
        ### 5. Action Items and Blockers (5 min)
    
        Open Action Items:
        | Action | Owner | Due Date | Status |
        |--------|-------|----------|--------|
        {list open actions}
    
        New Blockers:
        {list blockers requiring escalation}
    
        Tomorrow's On-Call: {name} (taking over at {HH:MM})
    
        ---
    
        Overall Status: {GREEN | YELLOW | RED}
    
        Key Decisions Made:
        {list decisions from standup}
    
        New Action Items:
        {list new actions assigned}
    
        Save to: .aiwg/deployment/hypercare-standup-{YYYY-MM-DD}.md
        """
    )
    
  2. Weekly Summary (if hypercare > 7 days):

    # On Day 7, 14, etc.:
    Task(
        subagent_type="operations-manager",
        description="Generate weekly hypercare summary",
        prompt="""
        Read all daily standups for week: .aiwg/deployment/hypercare-standup-*.md
    
        Create Weekly Summary:
    
        ## Hypercare Week {N} Summary
    
        **Week**: {date-range}
        **Overall Status**: {GREEN | YELLOW | RED}
    
        ### Production Stability
        - Availability: {percentage}% (Target: ≥99.9%)
        - Total Incidents: {count} (P0: {count}, P1: {count})
        - MTTR: {value} min
        - SLO Compliance: {percentage}%
    
        ### User Adoption
        - Active Users: {count} ({+/-percentage}% vs. previous week)
        - Feature Adoption: {percentage}%
        - User Sentiment: {POSITIVE | NEUTRAL | NEGATIVE}
    
        ### Support Health
        - Support Tickets: {count} ({+/-percentage}% vs. previous week)
        - Critical Issues: {count}
        - Response Time: {value}h (Target: <{SLA}h)
    
        ### Accomplishments
        {list accomplishments}
    
        ### Challenges
        {list challenges}
    
        ### Next Week Focus
        {list focus areas}
    
        Save to: .aiwg/reports/hypercare-week-{N}-summary.md
        """
    )
    

Communicate Progress:

⏳ Conducting daily standup...
✓ Daily standup report generated: .aiwg/deployment/hypercare-standup-{date}.md
  - Overall Health: {GREEN | YELLOW | RED}
  - Key Decisions: {count}
  - New Action Items: {count}
  - Escalations: {count}

Step 4: Post-Incident Reviews (For P0/P1 Incidents)

Purpose: Document root cause and corrective actions for all critical incidents

Your Actions:

  1. For Each P0/P1 Incident (within 48h of resolution):
    Task(
        subagent_type="reliability-engineer",
        description="Conduct post-incident review for {incident-ID}",
        prompt="""
        Read incident log: .aiwg/deployment/incidents/incident-{ID}.md
    
        Create Post-Incident Review (PIR):
    
        ## Post-Incident Review: {Incident-ID}
    
        **Date**: {YYYY-MM-DD}
        **Severity**: {P0/P1/P2/P3}
        **Duration**: {detection-to-resolution}
        **Impact**: {user-count} users, {downtime-minutes} minutes downtime
    
        ### Incident Summary
        {1-2 sentence description of what happened}
    
        ### Timeline
        | Time | Event | Actor |
        |------|-------|-------|
        {incident timeline from detection to resolution}
    
        ### Root Cause
        {Detailed technical root cause analysis}
    
        ### Contributing Factors
        1. {Factor 1 - e.g., insufficient testing}
        2. {Factor 2 - e.g., monitoring gap}
        3. {Factor 3 - e.g., unclear runbook}
    
        ### Corrective Actions
        | Action | Owner | Due Date | Status |
        |--------|-------|----------|--------|
        {list corrective actions to prevent recurrence}
    
        ### Lessons Learned
        - What went well: {list}
        - What could improve: {list}
        - Process changes needed: {list}
    
        Save to: .aiwg/deployment/incidents/pir-{ID}.md
    
        Update incident log with PIR link
        Track corrective actions in action tracker
        """
    )
    

Communicate Progress:

⏳ Conducting post-incident reviews...
✓ PIR complete: Incident-{ID} ({title})
  - Root cause: {summary}
  - Corrective actions: {count} assigned
  - Status: Tracking to completion

Step 5: Validate Exit Criteria and Generate Hypercare Report

Purpose: Ensure production is stable and support team is ready before ending hypercare

Your Actions:

  1. Validate Exit Criteria (on final day or when user requests):

    Task(
        subagent_type="operations-manager",
        description="Validate hypercare exit criteria",
        prompt="""
        Read all hypercare artifacts
    
        Validate Hypercare Exit Criteria:
    
        ## Hypercare Exit Criteria Validation
    
        **Hypercare Period**: Day {N} of {duration}
        **Validation Date**: {YYYY-MM-DD}
    
        ### Production Stability
        - [ ] Zero P0 (Critical) incidents in last 48 hours
        - [ ] Zero P1 (High) incidents in last 24 hours
        - [ ] All SLOs met for 72 consecutive hours
          - [ ] Availability ≥99.9%
          - [ ] Response time p95 <{SLA}ms
          - [ ] Error rate <0.1%
          - [ ] Throughput >{target} req/s
        - [ ] Error budget healthy: >{percentage}% remaining
        - [ ] No open P0/P1 incidents
    
        ### User Adoption
        - [ ] User adoption trending positive ({percentage}% growth)
        - [ ] Feature adoption >{target}% for critical features
        - [ ] User sentiment majority positive (≥70%)
        - [ ] Support ticket volume stable or decreasing
        - [ ] No critical user blockers unresolved
    
        ### Support Readiness
        - [ ] Support team trained and confident
        - [ ] Runbooks validated (all common issues documented)
        - [ ] Escalation paths tested and effective
        - [ ] Knowledge base updated with hypercare learnings
        - [ ] On-call rotation transitioned to standard support
    
        ### Documentation Complete
        - [ ] Hypercare report completed
        - [ ] Post-incident reviews completed (all P0/P1)
        - [ ] Corrective actions tracked (assigned, due dates set)
        - [ ] Lessons learned documented
        - [ ] Runbooks updated
    
        Overall Exit Criteria Status: {PASS | CONDITIONAL | FAIL}
        Decision: {END HYPERCARE | EXTEND HYPERCARE | ESCALATE}
    
        Save to: .aiwg/reports/hypercare-exit-criteria.md
        """
    )
    
  2. Generate Hypercare Exit Report (comprehensive final report):

    Task(
        subagent_type="operations-manager",
        description="Generate comprehensive hypercare exit report",
        prompt="""
        Read all hypercare artifacts:
        - .aiwg/deployment/hypercare-team-roster.md
        - .aiwg/deployment/slo-report-*.md (all days)
        - .aiwg/deployment/user-adoption-*.md (all days)
        - .aiwg/deployment/hypercare-standup-*.md (all days)
        - .aiwg/deployment/incidents/*.md (all incidents)
        - .aiwg/reports/hypercare-exit-criteria.md
    
        Generate Hypercare Exit Report:
    
        # Hypercare Report: {Project-Name}
    
        **Hypercare Period**: {start-date} to {end-date} ({duration} days)
        **Report Date**: {YYYY-MM-DD}
        **Report Author**: {Hypercare Lead}
    
        ## Executive Summary
    
        {2-3 sentence summary of hypercare outcomes}
    
        Overall Status: {SUCCESS | SUCCESS WITH CONDITIONS | CHALLENGES}
    
        Key Metrics:
        - Availability: {percentage}%
        - Total Incidents: {count} (P0: {count}, P1: {count})
        - User Adoption: {percentage}%
        - Support Tickets: {count}
    
        ## Production Stability Summary
    
        ### SLO Performance
        | SLO | Target | Achieved | Status |
        |-----|--------|----------|--------|
        {SLO compliance table}
    
        SLO Compliance Rate: {percentage}%
    
        ### Incident Summary
        Total Incidents: {count}
        - P0 (Critical): {count}
        - P1 (High): {count}
        - P2 (Medium): {count}
        - P3 (Low): {count}
    
        Key Metrics:
        - MTTD (Mean Time to Detect): {value} min
        - MTTA (Mean Time to Acknowledge): {value} min
        - MTTR (Mean Time to Resolve): {value} min
    
        Major Incidents:
        For each P0/P1:
        - Incident-ID: {title}
          - Date, Duration, Impact, Root Cause, Resolution
          - Corrective Actions: {count} assigned
    
        ### Performance Trends
        - Response Time: {IMPROVED | STABLE | DEGRADED} ({+/-percentage}% vs. pre-deployment)
        - Error Rate: {IMPROVED | STABLE | DEGRADED}
        - Resource Utilization: {HEALTHY | CONCERNING}
    
        ## User Adoption Summary
    
        ### Adoption Metrics
        - Active Users: {count} ({+/-percentage}% vs. pre-deployment)
        - Feature Adoption: {percentage}% (Target: >{target}%)
        - User Retention (Day 14): {percentage}%
    
        ### User Feedback
        - Total Feedback Items: {count}
        - Sentiment: {percentage}% positive
        - Net Promoter Score: {value}
    
        Top Praises: {list top 3}
        Top Complaints: {list top 3 with resolution status}
    
        ## Support Summary
    
        ### Ticket Volume
        - Total Support Tickets: {count}
        - Daily Average: {count} tickets/day
        - Trend: {DECREASING | STABLE | INCREASING}
    
        ### Support Performance
        - Average Response Time: {value}h (Target: <{SLA}h) - {✓/⚠/✗}
        - First Contact Resolution: {percentage}%
    
        ### Support Team Readiness
        - Team Confidence Level: {HIGH | MEDIUM | LOW}
        - Runbook Completeness: {percentage}%
    
        ## Lessons Learned
    
        ### What Went Well
        {list successes}
    
        ### What Could Improve
        {list improvements}
    
        ### Process Recommendations
        {list recommendations for future deployments}
    
        ## Corrective Actions
    
        Total Actions Identified: {count}
    
        | Action | Category | Owner | Due Date | Status |
        |--------|----------|-------|----------|--------|
        {corrective actions table}
    
        ## Handover to Standard Support
    
        ### Transition Plan
        - [ ] Standard on-call rotation activated (starting {date})
        - [ ] Support runbooks transferred
        - [ ] Knowledge base published
        - [ ] Support team training complete
        - [ ] Escalation paths updated for BAU
    
        ### Post-Hypercare Monitoring
        - Duration: {duration} days continued close monitoring
        - Responsible: {Support Lead}
        - Review Cadence: Weekly check-ins for {duration} weeks
    
        ## Conclusion
    
        {2-3 sentence summary and readiness for standard support}
    
        Recommendation: {END HYPERCARE | EXTEND HYPERCARE}
    
        Signoff:
        - Hypercare Lead: {name} - {date}
        - Reliability Engineer: {name} - {date}
        - Support Lead: {name} - {date}
        - Product Owner: {name} - {date}
        - Project Manager: {name} - {date}
    
        Save to: .aiwg/reports/hypercare-exit-report.md
        """
    )
    
  3. Present Exit Summary to User:

    # You present this directly (not via agent)
    
    Read .aiwg/reports/hypercare-exit-report.md
    
    Present summary:
    ─────────────────────────────────────────────
    Hypercare Monitoring Period Complete
    ─────────────────────────────────────────────
    
    **Hypercare Period**: {start-date} to {end-date} ({duration} days)
    **Overall Status**: {SUCCESS | SUCCESS WITH CONDITIONS | CHALLENGES}
    
    **Key Metrics**:
    ✓ Availability: {percentage}% (Target: ≥99.9%)
    ✓ Total Incidents: {count} (P0: {count}, P1: {count})
    ✓ User Adoption: {percentage}% of target
    ✓ Support Readiness: Team confident and ready
    
    **Exit Criteria Status**:
    ✓ Production Stability: {PASS | CONDITIONAL | FAIL}
    ✓ User Adoption: {PASS | CONDITIONAL | FAIL}
    ✓ Support Readiness: {PASS | CONDITIONAL | FAIL}
    ✓ Documentation: {PASS | CONDITIONAL | FAIL}
    
    **Decision**: {END HYPERCARE | EXTEND HYPERCARE | ESCALATE}
    
    **Artifacts Generated**:
    - Hypercare Team Roster (.aiwg/deployment/hypercare-team-roster.md)
    - Production Dashboard Config (.aiwg/deployment/production-dashboard-config.md)
    - Alert Escalation Matrix (.aiwg/deployment/alert-escalation-matrix.md)
    - Daily Standup Reports (.aiwg/deployment/hypercare-standup-*.md, {count} files)
    - SLO Reports (.aiwg/deployment/slo-report-*.md, {count} files)
    - User Adoption Reports (.aiwg/deployment/user-adoption-*.md, {count} files)
    - Incident Logs (.aiwg/deployment/incidents/*.md, {count} files)
    - Post-Incident Reviews (.aiwg/deployment/incidents/pir-*.md, {count} files)
    - Hypercare Exit Report (.aiwg/reports/hypercare-exit-report.md)
    
    **Next Steps**:
    - Review hypercare exit report with stakeholders
    - Obtain formal signoffs (5 required signatures)
    - If END HYPERCARE: Transition to standard support (run handoff workflow)
    - If EXTEND HYPERCARE: Address gaps, continue monitoring
    - If ESCALATE: Executive decision required
    
    **Transition to Standard Support**:
    - Standard on-call rotation activated: {date}
    - Continued monitoring period: {duration} days
    - Weekly check-ins scheduled
    
    ─────────────────────────────────────────────
    

Communicate Progress:

⏳ Validating hypercare exit criteria...
✓ Exit criteria validated: {PASS | CONDITIONAL | FAIL}
✓ Hypercare Exit Report generated: .aiwg/reports/hypercare-exit-report.md
✓ Transition plan documented

Quality Gates

Before marking workflow complete, verify:

  • Hypercare team established with 24/7 coverage
  • Production monitoring operational (dashboards, alerts)
  • Daily standups conducted and documented
  • All P0/P1 incidents have post-incident reviews
  • SLO compliance tracked daily
  • User adoption monitored and reported
  • Exit criteria validated
  • Hypercare exit report complete and approved
  • Transition to standard support planned

User Communication

At start: Confirm understanding and list activities

Understood. I'll orchestrate the hypercare monitoring period.

Hypercare Duration: {duration} days
Hypercare Period: {start-date} to {estimated-end-date}

This will establish:
- Hypercare team roster and 24/7 on-call rotation
- Production health monitoring dashboards
- Alert escalation and incident response procedures
- Daily standup coordination
- SLO tracking and user adoption monitoring
- Post-incident review process
- Hypercare exit criteria validation

I'll coordinate multiple agents for comprehensive monitoring and support.
Expected setup: 20-30 minutes.

Starting orchestration...

During: Update progress with clear indicators

✓ = Complete
⏳ = In progress
❌ = Error/blocked
⚠️ = Warning/attention needed

Daily: Provide daily status summary

Hypercare Day {N} of {duration}: {GREEN | YELLOW | RED}

Production Health:
✓ Availability: {percentage}% (Target: ≥99.9%)
✓ Performance: p95 {value}ms (Target: <{SLA}ms)
{⚠️ | ✓} Error Rate: {percentage}% (Target: <0.1%)

Incidents (24h):
- P0: {count}
- P1: {count}
- P2: {count}

User Adoption: {percentage}% ({trend})

Daily reports: .aiwg/deployment/hypercare-standup-{date}.md

At end: Summary report (see Step 5.3 above)

Error Handling

If P0 Incident During Hypercare:

❌ Critical incident detected - immediate response initiated

Incident: {incident-ID} - {title}
Severity: P0 (Complete outage / Data loss / Security breach)
Impact: {user-count} users affected

Actions:
1. On-call engineer + Hypercare Lead paged
2. Incident war room created: #incident-{date}-{ID}
3. Executive Sponsor notified
4. Status page updated

Response Timeline:
- Detection: {timestamp}
- Acknowledgment: {timestamp} (Target: Immediate)
- Time-to-engage: {minutes} min (Target: <15 min)

Current Status: {INVESTIGATING | MITIGATING | RESOLVED}

Impact on Exit Criteria: P0 incident resets 48h "zero critical incidents" requirement

Monitoring incident response...

If SLO Breach:

⚠️ SLO breach detected - immediate investigation required

SLO Breached: {SLO-name}
- Target: {target-value}
- Current: {actual-value}
- Duration: {duration} (continuous breach)

Impact:
- Error budget consumed: {percentage}%
- User impact: {description}

Actions:
1. Reliability Engineer investigating root cause
2. Metrics and logs under review
3. Mitigation plan in progress

If breach persists >24h: Recommend extending hypercare period
If error budget critically low: Recommend incident freeze

Monitoring for improvement...

If User Adoption Low:

⚠️ User adoption below target

Current Adoption: {percentage}% (Target: >{target}%)
Gap: {percentage} points

Analysis:
- Top User Issues: {list issues}
- Support Ticket Themes: {list themes}
- Potential Blockers: {list blockers}

Actions:
1. Product Owner engaged for adoption analysis
2. Support team reviewing common user issues
3. Documentation and training gaps identified

Decision Point:
- If blockers identified: Prioritize fixes, may extend hypercare
- If education needed: Launch awareness campaign
- If feature not valuable: Escalate to stakeholders

Impact on Exit Criteria: User adoption trend must improve before exit approval

If Support Team Overwhelmed:

⚠️ Support team capacity exceeded

Support Volume: {count} tickets/day (Capacity: {capacity})
Team Status: {STRESSED | OVERWHELMED}

Root Cause Analysis:
- Top Issue Categories: {list categories with counts}
- Product Bugs vs User Education: {ratio}

Immediate Relief Actions:
1. Additional support staff brought in (temp)
2. Engineering team handling overflow tickets
3. Workarounds created for top issues
4. FAQ and self-service guides published

Mitigation:
- Deploy hotfixes for high-frequency bugs
- Update documentation for common questions
- Additional training sessions scheduled

Impact on Exit Criteria: Support team must be confident and staffed before exit

If Exit Criteria Not Met:

⚠️ Hypercare exit criteria not met - extension recommended

Exit Criteria Status: {FAIL | CONDITIONAL}

Gaps Identified:
{list unmet criteria with details}

Recommendation: {EXTEND HYPERCARE | CONDITIONAL EXIT | ESCALATE}

Extension Plan:
- Additional Duration: {days} days
- Focus Areas: {list areas needing improvement}
- Re-validation Date: {date}

Escalating to user for decision...

Success Criteria

This orchestration succeeds when:

  • Hypercare team established with 24/7 coverage
  • Production monitoring operational (dashboards, alerts, SLO tracking)
  • Daily standups conducted for entire hypercare period
  • All incidents documented and P0/P1 incidents have PIRs
  • SLO compliance ≥95% (all SLOs met for 72+ consecutive hours)
  • Zero P0/P1 incidents in final 48/24 hours
  • User adoption trending positive (≥target%)
  • Support team ready and confident
  • Exit criteria validated: PASS or CONDITIONAL PASS
  • Hypercare exit report complete and approved
  • Transition to standard support planned and executed

Metrics to Track

During orchestration, track:

  • SLO compliance rate: % of SLOs met (target: 100% for 72h before exit)
  • Incident frequency: # of P0/P1/P2/P3 incidents (target: P0/P1 = 0 in final 48/24h)
  • Mean time to detect (MTTD): Minutes from incident to detection (target: <5 min)
  • Mean time to resolve (MTTR): Minutes from detection to resolution (target: P0 <120 min, P1 <480 min)
  • Error budget burn rate: % of monthly budget consumed (target: <50% during hypercare)
  • User adoption rate: % of target users actively engaged (target: ≥70%)
  • Support ticket volume: # of tickets/day (target: decreasing trend)
  • Support response time: Hours to first response (target: <SLA)

References

Templates (via $AIWG_ROOT):

  • Operational Readiness Review:
    templates/deployment/operational-readiness-review-template.md
  • SLO/SLI Definition:
    templates/deployment/slo-sli-template.md
  • Incident Response Runbook:
    templates/support/incident-response-runbook-template.md
  • Support Plan:
    templates/support/support-plan-template.md

Related Flows:

  • Gate Check:
    commands/flow-gate-check.md
  • Handoff Checklist:
    commands/flow-handoff-checklist.md
  • Deployment Workflow:
    commands/flow-deployment-workflow.md

SDLC Phase Context:

  • Phase: Transition (Deployment → Operations)
  • Milestone: Hypercare Complete (transition to BAU support)