Claude-skill-registry agent-ops-debugging
Systematic debugging approaches for isolating and fixing software defects. Use when something isn't working and the cause is unclear.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/agent-ops-debugging" ~/.claude/skills/majiayu000-claude-skill-registry-agent-ops-debugging && rm -rf "$T"
manifest:
skills/data/agent-ops-debugging/SKILL.mdsafety · automated scan (low risk)
This is a pattern-based risk scan, not a security review. Our crawler flagged:
- references .env files
Always read a skill's source content before installing. Patterns alone don't mean the skill is malicious — but they warrant attention.
source content
Skill: agent-ops-debugging
Systematic debugging approaches for isolating and fixing software defects
Purpose
Systematic problem isolation, root cause analysis, and defect resolution. Use when something isn't working and the cause is unclear.
Core Principles
1. Understand Before Acting
- Reproduce the issue: Can you consistently trigger the problem?
- Define expected vs actual: What should happen vs what is happening?
- Gather context: When does this occur? Under what conditions?
- Recent changes: What changed before this appeared?
2. Isolate the Problem
- Binary search: Comment out half the code, test, repeat
- Minimize reproduction: Create minimal test case
- Control variables: Change one thing at a time
- Eliminate noise: Remove unrelated factors
3. Form Hypotheses
- State your assumption: "I believe X is causing Y because..."
- Make predictions: "If my hypothesis is true, then Z should happen"
- Test predictions: Verify or refute each hypothesis
- Iterate: Refine hypothesis based on evidence
4. Fix and Verify
- Address root cause: Not just symptoms
- Minimize changes: Smallest fix that resolves the issue
- Add tests: Prevent regression
- Verify fix: Test the specific scenario and related scenarios
Systematic Debugging Process
Phase 1: Problem Definition
- Describe the bug in one sentence
- List reproduction steps (minimal set)
- Specify expected behavior
- Capture actual behavior (screenshots, logs, error messages)
- Identify scope: How widespread is this?
Phase 2: Information Gathering
- Check logs: Application logs, system logs, crash reports
- Inspect state: Database records, cache contents, file system
- Review code: Recent changes, related code paths
- Compare environments: Dev vs staging vs production differences
- Monitor resources: CPU, memory, disk, network during issue
Phase 3: Hypothesis Formation
Common failure patterns:
| Pattern | Symptoms | Where to Look |
|---|---|---|
| Timing issues | Intermittent, "works sometimes" | Race conditions, deadlocks, timeouts |
| State corruption | Wrong data, unexpected mutations | Shared state, caches, global variables |
| Resource exhaustion | Slows down, eventually fails | Memory leaks, connection pools |
| Configuration | Works elsewhere, fails here | Environment variables, settings files |
| Dependencies | Broke after update | Library versions, API changes |
| Assumption violations | Edge case failures | Code assumes something that isn't true |
Phase 4: Hypothesis Testing
- Add logging: Instrument code to verify assumptions
- Use debugger: Set breakpoints, inspect variables, step through
- Write tests: Create failing test that reproduces bug
- Simplify: Remove complexity while preserving failure
- Verify: Confirm hypothesis explains all symptoms
Phase 5: Resolution
- Implement fix: Address root cause, not symptoms
- Add regression test: Ensure bug doesn't return
- Review similar code: Check for same issue elsewhere
- Document: Add comments, update docs if behavior changed
- Verify: Test fix works and doesn't break other things
Debugging by Symptom
"It Works on My Machine"
| Check | Action |
|---|---|
| Environment differences | Python versions, OS, dependencies |
| Uncommitted config | Local settings, .env files |
| Race conditions | Timing-dependent issues |
| Data differences | Test with production data subset |
| Resource constraints | Production may have different limits |
Intermittent Failures
| Check | Action |
|---|---|
| Shared state | Global variables, singletons, caches |
| Timing | Race conditions, timeouts, async issues |
| Randomness | Random seeds, shuffling, sampling |
| Resource cleanup | Are resources properly released? |
| External dependencies | Network calls, third-party services |
Performance Degradation
| Check | Action |
|---|---|
| Profile first | Measure before optimizing |
| O(n²) | Nested loops, repeated work |
| I/O | Database queries, file reads, network |
| Memory | Leaks, large objects, excessive allocations |
| Caching | Repeated expensive operations |
Memory Leaks
| Check | Action |
|---|---|
| Profile memory | Track allocations over time |
| Circular references | GC can't collect cycles |
| Event listeners | Detached handlers keeping objects alive |
| Caches | Growing without bounds |
| Static collections | Accumulating entries |
Deadlocks
| Check | Action |
|---|---|
| Lock order | Identify held locks, acquisition order |
| Cycles | A waits for B, B waits for A |
| Timeouts | Are operations waiting indefinitely? |
| Hold-and-wait | Holding one lock while waiting for another |
Tool-Specific Guidance
Print/Log Statements
# Strategic placement with unique markers print(f"[DEBUG-001] user_id={user_id}, state={state}") # Include enough context logger.debug(f"Processing item {i}/{total}: {item.id}") # Remove after debugging!
Debugger
- Set breakpoints at suspicious locations, not everywhere
- Watch expressions for specific variables
- Check call stack to understand how you got here
- Step carefully through suspicious code
Tests for Debugging
- Write failing test that captures bug reproduction
- Use
to find when bug was introducedgit bisect - Mock external dependencies to isolate
- Property-based testing finds edge cases
Anti-Patterns to Avoid
| Anti-Pattern | Problem | Better Approach |
|---|---|---|
| Shotgun debugging | Random changes hoping something works | Form hypothesis, test, refine |
| Symptom treatment | Adding error handling to hide failures | Fix underlying cause |
| Assuming | "This variable can't be null" | Add assertion to verify |
| Overcomplicating | Complex debugging infrastructure | Start simple, add tools as needed |
| Ignoring evidence | Dismissing data that doesn't fit | Revise hypothesis to explain all |
Debugging Checklist
Before declaring "debugged":
- Root cause identified, not just symptom treated
- Fix is minimal and targeted
- Regression test added
- Related code checked for same issue
- Documentation updated if needed
- Fix verified in realistic scenario
- No new issues introduced
When to Escalate
Consider asking for help if:
- After 2 hours without progress
- Issue is in unfamiliar technology stack
- Problem involves complex distributed systems
- Security implications
- Production outage
- Going in circles (revisiting same hypotheses)
Recording Debug Sessions
Track in
.agent/focus.md:
## Debugging: [Issue Description] **Symptom**: [What's happening] **Expected**: [What should happen] **Reproduction**: [Steps to trigger] ### Hypotheses 1. [Hypothesis] → [TESTED: result] 2. [Hypothesis] → [PENDING] ### Evidence Gathered - Log at X showed Y - Variable Z had value W ### Resolution [Root cause and fix applied]