Learn-skills.dev multi-ai-debugging
Systematic debugging using Claude, Gemini, and Codex as specialized agents. Multi-agent root cause analysis, log analysis, error classification, and auto-fix generation. Use when debugging production issues, analyzing error logs, performing root cause analysis, troubleshooting complex systems, or implementing self-healing patterns.
git clone https://github.com/NeverSight/learn-skills.dev
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/adaptationio/skrillz/multi-ai-debugging" ~/.claude/skills/neversight-learn-skills-dev-multi-ai-debugging && rm -rf "$T"
data/skills-md/adaptationio/skrillz/multi-ai-debugging/SKILL.mdMulti-AI Debugging
Overview
multi-ai-debugging provides systematic debugging workflows using multiple AI models as specialized agents. Based on 2024-2025 best practices for AI-assisted debugging with multi-agent architectures.
Purpose: Systematic root cause analysis and fix generation using AI ensemble
Pattern: Task-based (6 independent debugging operations)
Key Principles (validated by tri-AI research):
- Multi-Agent Council - Specialized agents debate root causes before consensus
- Evaluator/Critic Loops - Fix agent + critic agent verify solutions
- Trace-Aware Analysis - Full execution context, not just error messages
- Semantic Log Analysis - LLM understanding beyond regex matching
- Cross-Stack Correlation - Connect frontend, backend, infra issues
- Auto-Remediation - Self-healing patterns where safe
Quality Targets:
- Root Cause Identification: >80% accuracy
- Time to Diagnosis: <30 minutes for common issues
- Fix Generation Success: >60% for known patterns
- False Positive Rate: <20% on error classification
When to Use
Use multi-ai-debugging when:
- Debugging production incidents
- Analyzing error logs and stack traces
- Performing root cause analysis (RCA)
- Troubleshooting complex multi-service systems
- Investigating performance degradation
- Understanding cascading failures
- Writing post-mortem reports
When NOT to Use:
- Simple syntax errors (IDE handles these)
- Clear compile-time errors
- Well-documented known issues
Prerequisites
Required
- Error information (logs, stack trace, error message)
- Access to relevant code
Recommended
- Execution traces (if available)
- System metrics/observability data
- Recent change history (git log)
- Gemini CLI for web research
- Codex CLI for deep code analysis
Integration
- OpenTelemetry traces (ideal)
- Log aggregation (CloudWatch, Datadog, etc.)
- APM tools (optional)
Operations
Operation 1: Quick Error Diagnosis
Time: 2-5 minutes Automation: 70% Purpose: Fast initial diagnosis for common errors
Process:
- Analyze Error:
Diagnose this error: Error: [PASTE ERROR MESSAGE] Stack trace: [PASTE STACK TRACE] Provide: 1. What type of error is this? 2. Most likely root cause (1-2 sentences) 3. Immediate fix suggestion 4. Prevention recommendation
- Verify with Gemini (for web research):
gemini -p "Search for solutions to this error: [ERROR MESSAGE] Find: - Common causes - Stack Overflow solutions - GitHub issues with fixes"
- Output: Quick diagnosis with fix suggestion
Operation 2: Root Cause Analysis (RCA)
Time: 15-45 minutes Automation: 50% Purpose: Deep root cause analysis for complex issues
Process:
Step 1: Gather Context (Context Agent)
# Recent changes git log --oneline -20 git diff HEAD~5..HEAD --stat # Related logs grep -r "ERROR\|Exception\|WARN" logs/ | tail -100 # System state # Check for relevant metrics, traces, etc.
Step 2: Hypothesis Generation (Analysis Agent)
Perform root cause analysis: **Error/Symptom**: [DESCRIPTION] **Context**: - Recent changes: [GIT LOG] - Logs: [RELEVANT LOG ENTRIES] - System state: [METRICS/OBSERVATIONS] - When started: [TIMESTAMP] - Affected scope: [USERS/SERVICES] **Tasks**: 1. List 3-5 probable root causes ranked by likelihood 2. For each hypothesis: - Evidence supporting it - Evidence against it - Confidence level (High/Medium/Low) 3. Recommend investigation steps for top hypothesis
Step 3: Cross-Validate (Verification Agent)
Verify this root cause hypothesis: Hypothesis: [TOP HYPOTHESIS] Evidence: [SUPPORTING EVIDENCE] Tasks: 1. What would we expect to see if this is correct? 2. What would disprove this hypothesis? 3. Design a reproduction test 4. Confidence assessment (0-100)
Step 4: Generate RCA Report
## Root Cause Analysis Report **Incident**: [DESCRIPTION] **Date**: [DATE] **Duration**: [DURATION] **Impact**: [USERS/SERVICES AFFECTED] ### Timeline - HH:MM - First error observed - HH:MM - Investigation began - HH:MM - Root cause identified - HH:MM - Fix deployed ### Root Cause [DETAILED EXPLANATION] ### Contributing Factors 1. [FACTOR 1] 2. [FACTOR 2] ### Resolution [FIX APPLIED] ### Prevention 1. [ACTION ITEM 1] 2. [ACTION ITEM 2]
Operation 3: Log Analysis & Classification
Time: 5-15 minutes Automation: 80% Purpose: Analyze and classify error logs
Process:
- Cluster Log Patterns:
Analyze these log entries: [PASTE 50-100 LOG LINES] Tasks: 1. Identify unique error patterns (cluster similar errors) 2. Classify each pattern: - Type: (Bug/Config/Network/Resource/Security/User Error) - Severity: (Critical/High/Medium/Low) - Impact: (Data Loss/Service Down/Degraded/Minor) 3. Count occurrences per pattern 4. Identify the root pattern (original error vs cascading) 5. Recommend priority order for investigation
- Semantic Analysis:
Perform semantic analysis on these logs: [LOG ENTRIES] Looking for: - Anomalies in timing/sequence - Correlation between events - Hidden dependencies - Patterns human might miss
- Output: Classified and prioritized error report
Operation 4: Multi-Agent Council Debugging
Time: 20-60 minutes Automation: 40% Purpose: Complex issues requiring multiple perspectives
Process:
Launch Parallel Agents:
Launch 4 debugging agents for this issue: Issue: [DESCRIPTION] Code: [RELEVANT CODE] Logs: [RELEVANT LOGS] Agent 1 (Code Reviewer): "Analyze the code for bugs. Focus on: - Logic errors - Edge cases - Race conditions - Resource leaks" Agent 2 (Log Analyzer): "Analyze the logs for clues. Focus on: - Error sequences - Timing patterns - State changes - External dependencies" Agent 3 (System Analyst): "Analyze system context. Focus on: - Resource constraints - Configuration issues - Dependency problems - Infrastructure state" Agent 4 (Historical Analyst): "Analyze history. Focus on: - Recent changes that could cause this - Similar past incidents - Regression indicators - Pattern matching to known issues"
Council Deliberation:
Synthesize findings from all debugging agents: Agent 1 (Code): [FINDINGS] Agent 2 (Logs): [FINDINGS] Agent 3 (System): [FINDINGS] Agent 4 (History): [FINDINGS] Tasks: 1. Find consensus root cause (where 2+ agents agree) 2. Resolve conflicting hypotheses 3. Combine evidence for strongest theory 4. Rate overall confidence (0-100) 5. Propose fix with verification steps
Operation 5: Auto-Fix Generation
Time: 10-30 minutes Automation: 60% Purpose: Generate and verify fixes
Process:
Step 1: Generate Fix (Fixer Agent)
Generate a fix for this issue: Issue: [ROOT CAUSE] Code: [AFFECTED CODE] Requirements: 1. Minimal change (fix only the issue) 2. Include error handling 3. Add comments explaining the fix 4. Suggest test cases to verify Output format: - File: [path] - Before: [original code] - After: [fixed code] - Explanation: [why this fixes it]
Step 2: Critique Fix (Critic Agent)
Critique this proposed fix: Issue: [ORIGINAL ISSUE] Proposed Fix: [FIX CODE] Evaluate: 1. Does it actually fix the root cause? 2. Could it introduce new bugs? 3. Edge cases not handled? 4. Performance implications? 5. Security implications? Verdict: APPROVE / NEEDS_REVISION / REJECT
Step 3: Generate Regression Test
Generate a regression test for this fix: Original Bug: [DESCRIPTION] Fix Applied: [FIX CODE] Create test that: 1. Would have caught the original bug 2. Verifies the fix works 3. Tests edge cases 4. Can run in CI/CD
Operation 6: Self-Healing Patterns
Time: Variable Automation: 90% Purpose: Automated detection and remediation
Process:
Define Remediation Playbooks:
# Example: Auto-remediation patterns PLAYBOOKS = { "disk_space_low": { "detection": "disk_usage > 90%", "actions": [ "compress_old_logs", "clear_temp_files", "alert_if_still_high" ] }, "memory_leak_detected": { "detection": "memory_growth > 10%/hour", "actions": [ "capture_heap_dump", "graceful_restart", "alert_team" ] }, "error_rate_spike": { "detection": "error_rate > 5%", "actions": [ "check_recent_deploys", "consider_rollback", "alert_on_call" ] } }
Configure Circuit Breakers:
# Intelligent circuit breaking class AICircuitBreaker: def should_open(self, metrics): """AI predicts cascading failure risk.""" prompt = f""" Given these metrics: - Error rate: {metrics['error_rate']} - Latency p99: {metrics['latency_p99']} - Dependencies health: {metrics['deps']} Should we open the circuit breaker? Risk of cascade: (Low/Medium/High) Recommendation: (OPEN/CLOSED/HALF_OPEN) """ return analyze(prompt)
Multi-AI Coordination
Agent Assignment Strategy
| Task | Primary | Verification | Strength |
|---|---|---|---|
| Log analysis | Gemini | Claude | Fast, large context |
| Code analysis | Claude | Codex | Deep understanding |
| Root cause | Claude | Gemini | Reasoning + search |
| Fix generation | Claude | Codex | Code + review |
| Research | Gemini | Claude | Web search |
Coordination Commands
Gemini for Log Search:
gemini -p "Analyze these logs and identify anomalies: [LOGS]"
Claude for Root Cause:
Given this debugging context, what's the root cause? [CONTEXT]
Codex for Fix Validation:
codex "Review this fix for correctness and edge cases: [FIX]"
Decision Trees
Error Type Classification
Error Type Decision Tree: 1. Is there a stack trace? ├── Yes → Go to Code Error Analysis └── No → Go to System Error Analysis 2. Code Error Analysis: ├── NullPointer/TypeError → Missing null check ├── IndexOutOfBounds → Boundary condition ├── Timeout → Resource/network issue ├── Permission denied → Auth/authz issue └── Unknown → Multi-agent analysis 3. System Error Analysis: ├── Connection refused → Service down ├── Disk full → Resource exhaustion ├── Out of memory → Memory leak/sizing ├── CPU spike → Performance issue └── Unknown → Multi-agent analysis
Severity Assessment
Severity Decision: CRITICAL (P1): - Data loss occurring - Security breach active - Service completely down - Revenue impact immediate HIGH (P2): - Service degraded significantly - Errors affecting >10% users - Potential data integrity issues MEDIUM (P3): - Errors affecting <10% users - Workaround available - Non-critical feature broken LOW (P4): - Cosmetic issues - Edge case errors - No user impact
Integration with Observability
OpenTelemetry Pattern
from opentelemetry import trace tracer = trace.get_tracer(__name__) @tracer.start_as_current_span("debug_session") def debug_with_tracing(error): """Debug with full trace context.""" span = trace.get_current_span() # Add debugging context to trace span.set_attribute("error.type", type(error).__name__) span.set_attribute("error.message", str(error)) # Gather trace context for AI context = { "trace_id": span.get_span_context().trace_id, "parent_spans": get_parent_spans(), "related_logs": get_correlated_logs() } return multi_agent_debug(context)
Log Correlation
def correlate_logs_for_debugging(trace_id, time_window="5m"): """Gather all logs related to an incident.""" # Query logs with trace correlation logs = query_logs(f""" trace_id:{trace_id} OR (level:ERROR AND timestamp:[now-{time_window} TO now]) """) # Feed to AI for analysis return analyze_logs_with_ai(logs)
Example Debugging Session
User: My API is returning 500 errors intermittently Claude: Let me run multi-agent debugging to diagnose this. [Launches Context, Analysis, and Verification agents] ## Debugging Report: Intermittent 500 Errors ### Agent Findings **Log Analyzer**: - Pattern: Errors spike every 5 minutes - Correlation: Occurs during background job runs - Log cluster: "Connection pool exhausted" **Code Reviewer**: - Found: Database connections not returned to pool in /api/users.py:145 - Missing: try/finally block for connection cleanup **System Analyst**: - Connection pool max: 20 - Concurrent requests during spike: 25+ - Background job opens 10 connections **Historical**: - Started after commit abc123 (2 days ago) - Commit added new background job without connection limit ### Consensus Root Cause (Confidence: 92%) Connection pool exhaustion due to: 1. New background job consuming connections 2. Missing connection cleanup in API handler ### Proposed Fix ```python # /api/users.py:145 # Before def get_user(id): conn = pool.get_connection() result = conn.execute("SELECT * FROM users WHERE id = ?", id) return result # Connection never returned! # After def get_user(id): conn = pool.get_connection() try: result = conn.execute("SELECT * FROM users WHERE id = ?", id) return result finally: pool.return_connection(conn) # Always return
Verification Steps
- Add connection tracking metric
- Run load test
- Monitor pool usage during background job
Prevention
- Add connection timeout
- Implement connection leak detection
- Review all pool usage
--- ## Related Skills - **multi-ai-code-review**: Review fixes before deployment - **multi-ai-verification**: Verify fix effectiveness - **multi-ai-testing**: Generate regression tests - **ecs-troubleshooting**: Container-specific debugging - **railway-troubleshooting**: Railway platform debugging --- ## References - `references/log-analysis-patterns.md` - Log analysis techniques - `references/self-healing-playbooks.md` - Auto-remediation patterns