Claude-skill-registry intermittent-issue-debugging
Debug issues that occur sporadically and are hard to reproduce. Use monitoring and systematic investigation to identify root causes of flaky behavior.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/intermittent-issue-debugging" ~/.claude/skills/majiayu000-claude-skill-registry-intermittent-issue-debugging && rm -rf "$T"
manifest:
skills/data/intermittent-issue-debugging/SKILL.mdsource content
Intermittent Issue Debugging
Overview
Intermittent issues are the most difficult to debug because they don't occur consistently. Systematic approach and comprehensive monitoring are essential.
When to Use
- Sporadic errors in logs
- Users report occasional issues
- Flaky tests
- Race conditions suspected
- Timing-dependent bugs
- Resource exhaustion issues
Instructions
1. Capturing Intermittent Issues
// Strategy 1: Comprehensive Logging // Add detailed logging around suspected code function processPayment(orderId) { const startTime = Date.now(); console.log(`[${startTime}] Payment start: order=${orderId}`); try { const result = chargeCard(orderId); console.log(`[${Date.now()}] Payment success: ${orderId}`); return result; } catch (error) { const duration = Date.now() - startTime; console.error(`[${Date.now()}] Payment FAILED:`, { order: orderId, error: error.message, duration_ms: duration, error_type: error.constructor.name, stack: error.stack }); throw error; } } // Strategy 2: Correlation IDs // Track requests across systems const correlationId = generateId(); logger.info({ correlationId, action: 'payment_start', orderId: 123 }); chargeCard(orderId, {headers: {correlationId}}); logger.info({ correlationId, action: 'payment_end', status: 'success' }); // Later, can grep logs by correlationId to see full trace // Strategy 3: Error Sampling // Capture full error context when occurs window.addEventListener('error', (event) => { const errorData = { message: event.message, url: event.filename, line: event.lineno, col: event.colno, stack: event.error?.stack, userAgent: navigator.userAgent, memory: performance.memory?.usedJSHeapSize, timestamp: new Date().toISOString() }; sendToMonitoring(errorData); // Send to error tracking });
2. Common Intermittent Issues
Issue: Race Condition Symptom: Inconsistent behavior depending on timing Example: Thread 1: Read count (5) Thread 2: Read count (5), increment to 6, write Thread 1: Increment to 6, write (overrides Thread 2) Result: Should be 7, but is 6 Debug: 1. Add detailed timestamps 2. Log all operations 3. Look for overlapping operations 4. Check if order matters Solution: - Use locks/mutexes - Use atomic operations - Use message queues - Ensure single writer --- Issue: Timing-Dependent Bug Symptom: Test passes sometimes, fails others Example: test_user_creation: 1. Create user (sometimes slow) 2. Check user exists 3. Fails if create took too long Debug: - Add timeout logging - Increase wait time - Add explicit waits - Mock slow operations Solution: - Explicit wait for condition - Remove time-dependent assertions - Use proper test fixtures --- Issue: Resource Exhaustion Symptom: Works fine, but after time fails Example: - Memory grows over time - Connections pool exhausted - Disk space fills up - Max open files reached Debug: - Monitor resources continuously - Check for leaks (memory growth) - Monitor connection count - Check long-running processes Solution: - Fix memory leak - Increase resource limits - Implement cleanup - Add monitoring/alerts --- Issue: Intermittent Network Failure Symptom: API calls occasionally fail Debug: - Check network logs - Identify timeout patterns - Check if time-of-day dependent - Check if load dependent Solution: - Implement exponential backoff retry - Add circuit breaker - Increase timeout - Add redundancy
3. Systematic Investigation Process
Step 1: Understand the Pattern Questions: - How often does it occur? (1/100, 1/1000?) - When does it occur? (time of day, load, specific user?) - What are the conditions? (network, memory, load?) - Is it reproducible? (deterministic or random?) - Any recent changes? Analysis: - Review error logs - Check error rate trends - Identify patterns - Correlate with changes Step 2: Reproduce Reliably Methods: - Increase test frequency (run 1000 times) - Stress test (heavy load) - Simulate poor conditions (network, memory) - Run on different machines - Run in production-like environment Goal: Make issue consistent to analyze Step 3: Add Instrumentation - Add detailed logging - Add monitoring metrics - Add trace IDs - Capture errors fully - Log system state Step 4: Capture the Issue - Recreate scenario - Capture full context - Note system state - Document conditions - Get reproduction case Step 5: Analyze Data - Review logs - Look for patterns - Compare normal vs error cases - Check timing correlations - Identify root cause Step 6: Implement Fix - Based on root cause - Verify with reproduction case - Test extensively - Add regression test
4. Monitoring & Prevention
Monitoring Strategy: Real User Monitoring (RUM): - Error rates by feature - Latency percentiles - User impact - Trend analysis Application Performance Monitoring (APM): - Request traces - Database query performance - External service calls - Resource usage Synthetic Monitoring: - Regular test execution - Simulate user flows - Alert on failures - Trend tracking --- Alerting: Setup alerts for: - Error rate spike - Response time >threshold - Memory growth trend - Failed transactions --- Prevention Checklist: [ ] Comprehensive logging in place [ ] Error tracking configured [ ] Performance monitoring active [ ] Resource monitoring enabled [ ] Correlation IDs used [ ] Failed requests captured [ ] Timeout values appropriate [ ] Retry logic implemented [ ] Circuit breakers in place [ ] Load testing performed [ ] Stress testing performed [ ] Race conditions reviewed [ ] Timing dependencies checked --- Tools: Monitoring: - New Relic / DataDog - Prometheus / Grafana - Sentry / Rollbar - Custom logging Testing: - Load testing (k6, JMeter) - Chaos engineering (gremlin) - Property-based testing (hypothesis) - Fuzz testing Debugging: - Distributed tracing (Jaeger) - Correlation IDs - Detailed logging - Debuggers
Key Points
- Comprehensive logging is essential
- Add correlation IDs for tracing
- Monitor for patterns and trends
- Stress test to reproduce
- Use detailed error context
- Implement exponential backoff for retries
- Monitor resource exhaustion
- Add circuit breakers for external services
- Log system state with errors
- Implement proper monitoring/alerting