Awesome-omni-skill Incident Retrospective
A postmortem (also called incident review or retrospective) is a structured
install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/backend/incident-retrospective" ~/.claude/skills/diegosouzapw-awesome-omni-skill-incident-retrospective && rm -rf "$T"
manifest:
skills/backend/incident-retrospective/SKILL.mdsafety · automated scan (low risk)
This is a pattern-based risk scan, not a security review. Our crawler flagged:
- references .env files
- references API keys
Always read a skill's source content before installing. Patterns alone don't mean the skill is malicious — but they warrant attention.
source content
Incident Retrospective
Skill Profile
(Select at least one profile to enable specific modules)
- DevOps
- Backend
- Frontend
- AI-RAG
- Security Critical
Overview
A postmortem (also called incident review or retrospective) is a structured process for analyzing incidents to understand what happened, why it happened, and how to prevent similar incidents in future. The goal is learning, not blaming.
Core Principle: "Blame system, not person. Every incident is an opportunity to learn and improve."
Why This Matters
- Psychological Safety: Engineers feel safe reporting issues
- Honest Analysis: Root causes are identified without blame
- Organizational Learning: Knowledge is shared and documented
- System Improvement: Action items prevent recurrence
- Cultural Shift: Failures become learning opportunities
- Reduced MTTR: Better response procedures over time
Core Concepts & Rules
1. Core Principles
- Follow established patterns and conventions
- Maintain consistency across codebase
- Document decisions and trade-offs
2. Implementation Guidelines
- Start with the simplest viable solution
- Iterate based on feedback and requirements
- Test thoroughly before deployment
Inputs / Outputs / Contracts
- Inputs:
- Incident timeline and logs
- Monitoring data and metrics
- System architecture and configuration
- Entry Conditions:
- Incident is resolved and stable
- Root cause investigation is complete
- Team has time allocated for analysis
- Outputs:
- Postmortem document with findings
- Action items with owners and deadlines
- Updated runbooks and documentation
- Artifacts Required (Deliverables):
- Postmortem template library
- Incident analysis report
- Action item tracking
- Acceptance Evidence:
- Completed postmortem with all sections filled
- Action items assigned to owners
- Stakeholder review completed
- Documentation updated
- Success Criteria:
- Root cause identified (Five Whys completed)
- Action items created with owners and due dates
- Postmortem reviewed and approved
- Learnings shared with team
Skill Composition
- Depends on: Failure Modes Analysis, Incident Triage
- Compatible with: Communication Templates, Escalation and Ownership
- Conflicts with: Systems without time for learning
- Related Skills:
- 40-system-resilience/failure-modes - Understanding what to analyze
- 41-incident-management/communication-templates - Communication during incidents
- 41-incident-management/escalation-and-ownership - Ownership during incidents
Quick Start / Implementation Example
- Review requirements and constraints
- Set up development environment
- Implement core functionality following patterns
- Write tests for critical paths
- Run tests and fix issues
- Document any deviations or decisions
# Example implementation following best practices def example_function(): # Your implementation here pass
Assumptions / Constraints / Non-goals
- Assumptions:
- Development environment is properly configured
- Required dependencies are available
- Team has basic understanding of domain
- Constraints:
- Must follow existing codebase conventions
- Time and resource limitations
- Compatibility requirements
- Non-goals:
- This skill does not cover edge cases outside scope
- Not a replacement for formal training
Compatibility & Prerequisites
- Supported Versions:
- Python 3.8+
- Node.js 16+
- Modern browsers (Chrome, Firefox, Safari, Edge)
- Required AI Tools:
- Code editor (VS Code recommended)
- Testing framework appropriate for language
- Version control (Git)
- Dependencies:
- Language-specific package manager
- Build tools
- Testing libraries
- Environment Setup:
keys:.env.example
,API_KEY
(no values)DATABASE_URL
Test Scenario Matrix (QA Strategy)
| Type | Focus Area | Required Scenarios / Mocks |
|---|---|---|
| Unit | Core Logic | Must cover primary logic and at least 3 edge/error cases. Target minimum 80% coverage |
| Integration | DB / API | All external API calls or database connections must be mocked during unit tests |
| E2E | User Journey | Critical user flows to test |
| Performance | Latency / Load | Benchmark requirements |
| Security | Vuln / Auth | SAST/DAST or dependency audit |
| Frontend | UX / A11y | Accessibility checklist (WCAG), Performance Budget (Lighthouse score) |
Technical Guardrails & Security Threat Model
1. Security & Privacy (Threat Model)
- Top Threats: Injection attacks, authentication bypass, data exposure
- Data Handling: Sanitize all user inputs to prevent Injection attacks. Never log raw PII
- Secrets Management: No hardcoded API keys. Use Env Vars/Secrets Manager
- Authorization: Validate user permissions before state changes
2. Performance & Resources
- Execution Efficiency: Consider time complexity for algorithms
- Memory Management: Use streams/pagination for large data
- Resource Cleanup: Close DB connections/file handlers in finally blocks
3. Architecture & Scalability
- Design Pattern: Follow SOLID principles, use Dependency Injection
- Modularity: Decouple logic from UI/Frameworks
4. Observability & Reliability
- Logging Standards: Structured JSON, include trace IDs
request_id - Metrics: Track
,error_rate
,latencyqueue_depth - Error Handling: Standardized error codes, no bare except
- Observability Artifacts:
- Log Fields: timestamp, level, message, request_id
- Metrics: request_count, error_count, response_time
- Dashboards/Alerts: High Error Rate > 5%
Agent Directives & Error Recovery
(ข้อกำหนดสำหรับ AI Agent ในการคิดและแก้ปัญหาเมื่อเกิดข้อผิดพลาด)
- Thinking Process: Analyze root cause before fixing. Do not brute-force.
- Fallback Strategy: Stop after 3 failed test attempts. Output root cause and ask for human intervention/clarification.
- Self-Review: Check against Guardrails & Anti-patterns before finalizing.
- Output Constraints: Output ONLY the modified code block. Do not explain unless asked.
Definition of Done (DoD) Checklist
- Tests passed + coverage met
- Lint/Typecheck passed
- Logging/Metrics/Trace implemented
- Security checks passed
- Documentation/Changelog updated
- Accessibility/Performance requirements met (if frontend)
Anti-patterns / Pitfalls
- ⛔ Don't: Log PII, catch-all exception, N+1 queries
- ⚠️ Watch out for: Common symptoms and quick fixes
- 💡 Instead: Use proper error handling, pagination, and logging
Reference Links & Examples
- Internal documentation and examples
- Official documentation and best practices
- Community resources and discussions
Versioning & Changelog
- Version: 1.0.0
- Changelog:
- 2026-02-22: Initial version with complete template structure