Claude-skill-registry agent-ops-debugging

Systematic debugging approaches for isolating and fixing software defects. Use when something isn't working and the cause is unclear.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/agent-ops-debugging" ~/.claude/skills/majiayu000-claude-skill-registry-agent-ops-debugging && rm -rf "$T"
manifest: skills/data/agent-ops-debugging/SKILL.md
safety · automated scan (low risk)
This is a pattern-based risk scan, not a security review. Our crawler flagged:
  • references .env files
Always read a skill's source content before installing. Patterns alone don't mean the skill is malicious — but they warrant attention.
source content

Skill: agent-ops-debugging

Systematic debugging approaches for isolating and fixing software defects


Purpose

Systematic problem isolation, root cause analysis, and defect resolution. Use when something isn't working and the cause is unclear.


Core Principles

1. Understand Before Acting

  • Reproduce the issue: Can you consistently trigger the problem?
  • Define expected vs actual: What should happen vs what is happening?
  • Gather context: When does this occur? Under what conditions?
  • Recent changes: What changed before this appeared?

2. Isolate the Problem

  • Binary search: Comment out half the code, test, repeat
  • Minimize reproduction: Create minimal test case
  • Control variables: Change one thing at a time
  • Eliminate noise: Remove unrelated factors

3. Form Hypotheses

  • State your assumption: "I believe X is causing Y because..."
  • Make predictions: "If my hypothesis is true, then Z should happen"
  • Test predictions: Verify or refute each hypothesis
  • Iterate: Refine hypothesis based on evidence

4. Fix and Verify

  • Address root cause: Not just symptoms
  • Minimize changes: Smallest fix that resolves the issue
  • Add tests: Prevent regression
  • Verify fix: Test the specific scenario and related scenarios

Systematic Debugging Process

Phase 1: Problem Definition

  1. Describe the bug in one sentence
  2. List reproduction steps (minimal set)
  3. Specify expected behavior
  4. Capture actual behavior (screenshots, logs, error messages)
  5. Identify scope: How widespread is this?

Phase 2: Information Gathering

  1. Check logs: Application logs, system logs, crash reports
  2. Inspect state: Database records, cache contents, file system
  3. Review code: Recent changes, related code paths
  4. Compare environments: Dev vs staging vs production differences
  5. Monitor resources: CPU, memory, disk, network during issue

Phase 3: Hypothesis Formation

Common failure patterns:

PatternSymptomsWhere to Look
Timing issuesIntermittent, "works sometimes"Race conditions, deadlocks, timeouts
State corruptionWrong data, unexpected mutationsShared state, caches, global variables
Resource exhaustionSlows down, eventually failsMemory leaks, connection pools
ConfigurationWorks elsewhere, fails hereEnvironment variables, settings files
DependenciesBroke after updateLibrary versions, API changes
Assumption violationsEdge case failuresCode assumes something that isn't true

Phase 4: Hypothesis Testing

  1. Add logging: Instrument code to verify assumptions
  2. Use debugger: Set breakpoints, inspect variables, step through
  3. Write tests: Create failing test that reproduces bug
  4. Simplify: Remove complexity while preserving failure
  5. Verify: Confirm hypothesis explains all symptoms

Phase 5: Resolution

  1. Implement fix: Address root cause, not symptoms
  2. Add regression test: Ensure bug doesn't return
  3. Review similar code: Check for same issue elsewhere
  4. Document: Add comments, update docs if behavior changed
  5. Verify: Test fix works and doesn't break other things

Debugging by Symptom

"It Works on My Machine"

CheckAction
Environment differencesPython versions, OS, dependencies
Uncommitted configLocal settings, .env files
Race conditionsTiming-dependent issues
Data differencesTest with production data subset
Resource constraintsProduction may have different limits

Intermittent Failures

CheckAction
Shared stateGlobal variables, singletons, caches
TimingRace conditions, timeouts, async issues
RandomnessRandom seeds, shuffling, sampling
Resource cleanupAre resources properly released?
External dependenciesNetwork calls, third-party services

Performance Degradation

CheckAction
Profile firstMeasure before optimizing
O(n²)Nested loops, repeated work
I/ODatabase queries, file reads, network
MemoryLeaks, large objects, excessive allocations
CachingRepeated expensive operations

Memory Leaks

CheckAction
Profile memoryTrack allocations over time
Circular referencesGC can't collect cycles
Event listenersDetached handlers keeping objects alive
CachesGrowing without bounds
Static collectionsAccumulating entries

Deadlocks

CheckAction
Lock orderIdentify held locks, acquisition order
CyclesA waits for B, B waits for A
TimeoutsAre operations waiting indefinitely?
Hold-and-waitHolding one lock while waiting for another

Tool-Specific Guidance

Print/Log Statements

# Strategic placement with unique markers
print(f"[DEBUG-001] user_id={user_id}, state={state}")

# Include enough context
logger.debug(f"Processing item {i}/{total}: {item.id}")

# Remove after debugging!

Debugger

  • Set breakpoints at suspicious locations, not everywhere
  • Watch expressions for specific variables
  • Check call stack to understand how you got here
  • Step carefully through suspicious code

Tests for Debugging

  • Write failing test that captures bug reproduction
  • Use
    git bisect
    to find when bug was introduced
  • Mock external dependencies to isolate
  • Property-based testing finds edge cases

Anti-Patterns to Avoid

Anti-PatternProblemBetter Approach
Shotgun debuggingRandom changes hoping something worksForm hypothesis, test, refine
Symptom treatmentAdding error handling to hide failuresFix underlying cause
Assuming"This variable can't be null"Add assertion to verify
OvercomplicatingComplex debugging infrastructureStart simple, add tools as needed
Ignoring evidenceDismissing data that doesn't fitRevise hypothesis to explain all

Debugging Checklist

Before declaring "debugged":

  • Root cause identified, not just symptom treated
  • Fix is minimal and targeted
  • Regression test added
  • Related code checked for same issue
  • Documentation updated if needed
  • Fix verified in realistic scenario
  • No new issues introduced

When to Escalate

Consider asking for help if:

  • After 2 hours without progress
  • Issue is in unfamiliar technology stack
  • Problem involves complex distributed systems
  • Security implications
  • Production outage
  • Going in circles (revisiting same hypotheses)

Recording Debug Sessions

Track in

.agent/focus.md
:

## Debugging: [Issue Description]

**Symptom**: [What's happening]
**Expected**: [What should happen]
**Reproduction**: [Steps to trigger]

### Hypotheses
1. [Hypothesis] → [TESTED: result]
2. [Hypothesis] → [PENDING]

### Evidence Gathered
- Log at X showed Y
- Variable Z had value W

### Resolution
[Root cause and fix applied]