Awesome-omni-skill Incident Retrospective

A postmortem (also called incident review or retrospective) is a structured

install

source · Clone the upstream repo

git clone https://github.com/diegosouzapw/awesome-omni-skill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/backend/incident-retrospective" ~/.claude/skills/diegosouzapw-awesome-omni-skill-incident-retrospective && rm -rf "$T"

manifest: skills/backend/incident-retrospective/SKILL.md

safety · automated scan (low risk)

This is a pattern-based risk scan, not a security review. Our crawler flagged:

references .env files
references API keys

Always read a skill's source content before installing. Patterns alone don't mean the skill is malicious — but they warrant attention.

source content

Incident Retrospective

Skill Profile

(Select at least one profile to enable specific modules)

Overview

A postmortem (also called incident review or retrospective) is a structured process for analyzing incidents to understand what happened, why it happened, and how to prevent similar incidents in future. The goal is learning, not blaming.

Core Principle: "Blame system, not person. Every incident is an opportunity to learn and improve."

Why This Matters

Psychological Safety: Engineers feel safe reporting issues
Honest Analysis: Root causes are identified without blame
Organizational Learning: Knowledge is shared and documented
System Improvement: Action items prevent recurrence
Cultural Shift: Failures become learning opportunities
Reduced MTTR: Better response procedures over time

Core Concepts & Rules

1. Core Principles

Follow established patterns and conventions
Maintain consistency across codebase
Document decisions and trade-offs

2. Implementation Guidelines

Start with the simplest viable solution
Iterate based on feedback and requirements
Test thoroughly before deployment

Inputs / Outputs / Contracts

Inputs:
- Incident timeline and logs
- Monitoring data and metrics
- System architecture and configuration
Entry Conditions:
- Incident is resolved and stable
- Root cause investigation is complete
- Team has time allocated for analysis
Outputs:
- Postmortem document with findings
- Action items with owners and deadlines
- Updated runbooks and documentation
Artifacts Required (Deliverables):
- Postmortem template library
- Incident analysis report
- Action item tracking
Acceptance Evidence:
- Completed postmortem with all sections filled
- Action items assigned to owners
- Stakeholder review completed
- Documentation updated
Success Criteria:
- Root cause identified (Five Whys completed)
- Action items created with owners and due dates
- Postmortem reviewed and approved
- Learnings shared with team

Skill Composition

Depends on: Failure Modes Analysis, Incident Triage
Compatible with: Communication Templates, Escalation and Ownership
Conflicts with: Systems without time for learning
Related Skills:
- 40-system-resilience/failure-modes - Understanding what to analyze
- 41-incident-management/communication-templates - Communication during incidents
- 41-incident-management/escalation-and-ownership - Ownership during incidents

Quick Start / Implementation Example

Review requirements and constraints
Set up development environment
Implement core functionality following patterns
Write tests for critical paths
Run tests and fix issues
Document any deviations or decisions

# Example implementation following best practices
def example_function():
    # Your implementation here
    pass

Assumptions / Constraints / Non-goals

Assumptions:
- Development environment is properly configured
- Required dependencies are available
- Team has basic understanding of domain
Constraints:
- Must follow existing codebase conventions
- Time and resource limitations
- Compatibility requirements
Non-goals:
- This skill does not cover edge cases outside scope
- Not a replacement for formal training

Compatibility & Prerequisites

Supported Versions:
- Python 3.8+
- Node.js 16+
- Modern browsers (Chrome, Firefox, Safari, Edge)
Required AI Tools:
- Code editor (VS Code recommended)
- Testing framework appropriate for language
- Version control (Git)
Dependencies:
- Language-specific package manager
- Build tools
- Testing libraries
Environment Setup:
- ```
.env.example
```
  keys:
```
API_KEY
```
  ,
```
DATABASE_URL
```
  (no values)

Test Scenario Matrix (QA Strategy)

Type	Focus Area	Required Scenarios / Mocks
Unit	Core Logic	Must cover primary logic and at least 3 edge/error cases. Target minimum 80% coverage
Integration	DB / API	All external API calls or database connections must be mocked during unit tests
E2E	User Journey	Critical user flows to test
Performance	Latency / Load	Benchmark requirements
Security	Vuln / Auth	SAST/DAST or dependency audit
Frontend	UX / A11y	Accessibility checklist (WCAG), Performance Budget (Lighthouse score)

Technical Guardrails & Security Threat Model

1. Security & Privacy (Threat Model)

Top Threats: Injection attacks, authentication bypass, data exposure

Data Handling: Sanitize all user inputs to prevent Injection attacks. Never log raw PII
Secrets Management: No hardcoded API keys. Use Env Vars/Secrets Manager
Authorization: Validate user permissions before state changes

2. Performance & Resources

Execution Efficiency: Consider time complexity for algorithms
Memory Management: Use streams/pagination for large data
Resource Cleanup: Close DB connections/file handlers in finally blocks

3. Architecture & Scalability

Design Pattern: Follow SOLID principles, use Dependency Injection
Modularity: Decouple logic from UI/Frameworks

4. Observability & Reliability

Logging Standards: Structured JSON, include trace IDs
```
request_id
```
Metrics: Track
```
error_rate
```
,
```
latency
```
,
```
queue_depth
```
Error Handling: Standardized error codes, no bare except
Observability Artifacts:
- Log Fields: timestamp, level, message, request_id
- Metrics: request_count, error_count, response_time
- Dashboards/Alerts: High Error Rate > 5%

Agent Directives & Error Recovery

(ข้อกำหนดสำหรับ AI Agent ในการคิดและแก้ปัญหาเมื่อเกิดข้อผิดพลาด)

Thinking Process: Analyze root cause before fixing. Do not brute-force.
Fallback Strategy: Stop after 3 failed test attempts. Output root cause and ask for human intervention/clarification.
Self-Review: Check against Guardrails & Anti-patterns before finalizing.
Output Constraints: Output ONLY the modified code block. Do not explain unless asked.

Definition of Done (DoD) Checklist

Tests passed + coverage met
Lint/Typecheck passed
Logging/Metrics/Trace implemented
Security checks passed
Documentation/Changelog updated
Accessibility/Performance requirements met (if frontend)

Anti-patterns / Pitfalls

⛔ Don't: Log PII, catch-all exception, N+1 queries
⚠️ Watch out for: Common symptoms and quick fixes
💡 Instead: Use proper error handling, pagination, and logging

Reference Links & Examples

Internal documentation and examples
Official documentation and best practices
Community resources and discussions

Versioning & Changelog

Version: 1.0.0
Changelog:
- 2026-02-22: Initial version with complete template structure