Spartan-ai-toolkit service-debugging
Structured debugging runbook for backend services. Use when investigating production issues, API errors, performance problems, or when something broke and you need to find why.
install
source · Clone the upstream repo
git clone https://github.com/c0x12c/ai-toolkit
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/c0x12c/ai-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/toolkit/skills/service-debugging" ~/.claude/skills/spartan-stratos-spartan-ai-toolkit-service-debugging && rm -rf "$T"
manifest:
toolkit/skills/service-debugging/SKILL.mdsource content
Service Debugging
Structured approach to investigating and fixing service issues. Symptoms in, root cause out.
When to Use
- API endpoint returning errors (4xx, 5xx)
- Performance degradation or slow queries
- Service not starting or crashing
- Data inconsistency between services
- After a deploy when something broke
- User reports "something is broken"
Process
1. Gather Symptoms
Before touching code, collect:
- What's broken? (specific endpoint, feature, or behavior)
- When did it start? (after a deploy? gradually? suddenly?)
- Who's affected? (all users, specific users, specific data?)
- Error messages? (logs, HTTP responses, stack traces)
2. Check the Obvious
Run these first — they catch 80% of issues:
# Recent deploys (did someone push something?) git log --oneline -10 # Service health curl -s http://localhost:8080/health | jq . # Recent errors in logs grep -i "error\|exception\|fatal" logs/app.log | tail -20 # Database connectivity psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c "SELECT 1" # Environment variables (missing or wrong?) env | grep -i "DB_\|API_\|SECRET_" | sort
3. Narrow Down
| Symptom | Check First |
|---|---|
| 500 errors | Stack trace in logs → find the throwing line |
| 404 errors | Route registration → is the controller loaded? |
| 401/403 errors | Auth config → is @Secured correct? Token valid? |
| Slow response | Database → run EXPLAIN on the slow query |
| Timeout | External service → is the downstream API responding? |
| Data missing | Soft delete → is set? Wrong query filter? |
| Service won't start | Bean creation → check @Factory and @Singleton wiring |
4. Reproduce
- Can you trigger the bug locally?
- What's the minimal request that fails?
- Does it fail consistently or intermittently?
5. Find Root Cause
Use git bisect if it's a regression:
git bisect start git bisect bad HEAD git bisect good <last-known-good-commit> # Test each commit until you find the one that broke it
Use grep to find related code:
# Find where the error message comes from grep -r "error message text" --include="*.kt" src/ # Find all callers of a broken function grep -r "functionName" --include="*.kt" src/
6. Fix and Verify
- Write a test that reproduces the bug (red)
- Fix the code (green)
- Run full test suite
- Test manually if it's a user-facing issue
See
for a catalog of frequently seen bugs and their fixes.common-issues.md
Gotchas
- Don't fix the symptom, fix the cause. Adding a null check that hides a data issue means the data issue will bite you later.
- Check the deploy log before blaming the code. Config changes, environment variable updates, and infra changes cause more outages than code bugs.
- "It works on my machine" usually means environment difference. Compare local env vars, database state, and service versions with the target environment.
- Intermittent failures are usually race conditions. If it fails 1 in 10 times, look for concurrent access, shared mutable state, or connection pool exhaustion.
- Don't restart the service as your first debugging step. You'll lose the state that helps you diagnose. Read logs first, then restart if needed.
- Soft-deleted records are the #1 "data missing" cause. Always check
in your queries.deleted_at IS NULL
Rules
- Always gather symptoms before changing code
- Write a failing test before fixing
- Check recent git history — most bugs are regressions
- Don't deploy a fix without understanding the root cause
- Document the incident if it affected users