Claude-skill-registry root-cause-analysis

Find the true source, not symptoms — systematic debugging from observation to permanent fix

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/other/other/root-cause-analysis-fabioc-aloha-youtube-mcp-vscode" ~/.claude/skills/majiayu000-claude-skill-registry-root-cause-analysis && rm -rf "$T"
manifest: skills/other/other/root-cause-analysis-fabioc-aloha-youtube-mcp-vscode/SKILL.md
source content

Root Cause Analysis Skill

If you fixed it but it came back, you fixed a symptom.

Core Principle

Every symptom has a cause. Every cause has a deeper cause. Keep digging until you reach something you can prevent, not just fix.

5 Whys — Extended Example

#QuestionAnswer
1Why did the page crash?JavaScript threw a TypeError on null
2Why was the value null?The API returned an empty response
3Why did the API return empty?The database query timed out
4Why did the query time out?Missing index on a 10M-row table
5Why was the index missing?No performance review in the PR process

Root cause: Process gap (no performance review), not the missing index. Fix the system: Add performance checklist to PR template, not just add the index.

5 Whys Traps

TrapExampleHow to Avoid
Stopping at human error"Dev forgot to add the index"Ask why was it possible to forget?
Single chain onlyOnly follow one branchBranch at each Why if multiple causes
Speculation without evidence"Probably because of..."Each answer must have evidence
Going too deepWhy #12: "Because physics"Stop when you reach an actionable system change

Cause Categories

CategoryCommon PatternsInvestigation Tools
CodeNull reference, off-by-one, race condition, type mismatchDebugger, unit tests, static analysis
DataCorrupt input, unexpected format, encoding issuesQuery logs, data validation, sample inspection
InfrastructureDisk full, memory exhaustion, network partitionMetrics dashboards, health endpoints,
top
/
df
DependenciesBreaking change, version mismatch, transitive conflictLockfile diff, changelog review,
npm ls
ConfigurationWrong env var, feature flag state, missing secretConfig diff, environment comparison
ProcessMissing review, unclear ownership, no runbookPost-mortem patterns, team interviews

Investigation Techniques

Binary Search Debugging

When you don't know where the bug is, halve the search space:

  1. Identify the last known good state (commit, deploy, timestamp)
  2. git bisect
    between good and bad
  3. Each step: does the bug exist? Yes → go earlier. No → go later.
  4. Result: the exact commit that introduced the bug.

Timeline Reconstruction

TimeEventSource
T-24hDeploy v2.3.1CI/CD logs
T-12hConfig change: cache TTL 60→30sConfig audit log
T-2hFirst user reportSupport tickets
T-0Alert firedMonitoring

Key question: What changed between "working" and "broken"?

Correlation vs Causation

Evidence TypeConfidenceExample
Reproduces on demandHigh"Every time I submit this form..."
Correlates with a deployMedium"Started after we deployed"
Timing coincidenceLow"Started Monday" (traffic patterns?)
"It's never done this before"Very LowMemory is unreliable — check logs

Fix + Prevent Pattern

PhasePurposeExampleDeadline
ImmediateStop the bleedingRollback, disable feature, redirect trafficNow
PermanentFix root causeAdd missing index, fix validation, patch dependencyThis sprint
PreventionStop recurrenceAdd CI check, monitoring alert, runbook, PR checklistNext sprint

Test the fix: The permanent fix should make the immediate fix unnecessary. If you remove the band-aid and the symptom returns, you haven't found root cause.

Common Symptom → Root Cause Patterns

SymptomObvious CauseDeeper Root Cause
Memory leakUnclosed resourceNo resource cleanup pattern in codebase
N+1 queriesMissing joinORM hides query count, no query logging
Intermittent test failureTiming-dependentShared mutable state between tests
"Works on my machine"Different environmentNo environment parity tooling (Docker, etc.)
Data corruptionMissing validationValidation in UI only, not at API boundary
Slow deploysLarge artifactNo build caching, monorepo without selective builds

Post-Mortem Integration

The RCA section of a post-mortem should include:

  1. The 5 Whys chain (with evidence for each level)
  2. Contributing factors (things that made it worse, not the direct cause)
  3. What we were lucky about (things that could have made it much worse)
  4. Action items with owners and dates for permanent fix + prevention

Synapses

See synapses.json for connections.