Aiwg troubleshooting-guide

Generate troubleshooting documentation

install
source · Clone the upstream repo
git clone https://github.com/jmagly/aiwg
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agentic/code/frameworks/sdlc-complete/skills/troubleshooting-guide" ~/.claude/skills/jmagly-aiwg-troubleshooting-guide-e92ec9 && rm -rf "$T"
manifest: agentic/code/frameworks/sdlc-complete/skills/troubleshooting-guide/SKILL.md
source content

Troubleshooting Guide Generator Command

Generate troubleshooting documentation

Instructions

Follow this systematic approach to create troubleshooting guides: $ARGUMENTS

  1. System Overview and Architecture

    • Document the system architecture and components
    • Map out dependencies and integrations
    • Identify critical paths and failure points
    • Create system topology diagrams
    • Document data flow and communication patterns
  2. Common Issues Identification

    • Collect historical support tickets and issues
    • Interview team members about frequent problems
    • Analyze error logs and monitoring data
    • Review user feedback and complaints
    • Identify patterns in system failures
  3. Troubleshooting Framework

    • Establish systematic diagnostic procedures
    • Create problem isolation methodologies
    • Document escalation paths and procedures
    • Set up logging and monitoring checkpoints
    • Define severity levels and response times
  4. Diagnostic Tools and Commands

    ## Essential Diagnostic Commands
    
    ### System Health
    ```bash
    # Check system resources
    top                    # CPU and memory usage
    df -h                 # Disk space
    free -m               # Memory usage
    netstat -tuln         # Network connections
    
    # Application logs
    tail -f /var/log/app.log
    journalctl -u service-name -f
    
    # Database connectivity
    mysql -u user -p -e "SELECT 1"
    psql -h host -U user -d db -c "SELECT 1"
    
  5. Issue Categories and Solutions

    Performance Issues:

    ### Slow Response Times
    
    **Symptoms:**
    - API responses > 5 seconds
    - User interface freezing
    - Database timeouts
    
    **Diagnostic Steps:**
    1. Check system resources (CPU, memory, disk)
    2. Review application logs for errors
    3. Analyze database query performance
    4. Check network connectivity and latency
    
    **Common Causes:**
    - Database connection pool exhaustion
    - Inefficient database queries
    - Memory leaks in application
    - Network bandwidth limitations
    
    **Solutions:**
    - Restart application services
    - Optimize database queries
    - Increase connection pool size
    - Scale infrastructure resources
    
  6. Error Code Documentation

    ## Error Code Reference
    
    ### HTTP Status Codes
    - **500 Internal Server Error**
      - Check application logs for stack traces
      - Verify database connectivity
      - Check environment variables
    
    - **404 Not Found**
      - Verify URL routing configuration
      - Check if resources exist
      - Review API endpoint documentation
    
    - **503 Service Unavailable**
      - Check service health status
      - Verify load balancer configuration
      - Check for maintenance mode
    
  7. Environment-Specific Issues

    • Document development environment problems
    • Address staging/testing environment issues
    • Cover production-specific troubleshooting
    • Include local development setup problems
  8. Database Troubleshooting

    ### Database Connection Issues
    
    **Symptoms:**
    - "Connection refused" errors
    - "Too many connections" errors
    - Slow query performance
    
    **Diagnostic Commands:**
    ```sql
    -- Check active connections
    SHOW PROCESSLIST;
    
    -- Check database size
    SELECT table_schema, 
           ROUND(SUM(data_length + index_length) / 1024 / 1024, 1) AS 'DB Size in MB' 
    FROM information_schema.tables 
    GROUP BY table_schema;
    
    -- Check slow queries
    SHOW VARIABLES LIKE 'slow_query_log';
    
  9. Network and Connectivity Issues

    ### Network Troubleshooting
    
    **Basic Connectivity:**
    ```bash
    # Test basic connectivity
    ping example.com
    telnet host port
    curl -v https://api.example.com/health
    
    # DNS resolution
    nslookup example.com
    dig example.com
    
    # Network routing
    traceroute example.com
    

    SSL/TLS Issues:

    # Check SSL certificate
    openssl s_client -connect example.com:443
    curl -vI https://example.com
    
  10. Application-Specific Troubleshooting

    Memory Issues:

    ### Out of Memory Errors
    
    **Java Applications:**
    ```bash
    # Check heap usage
    jstat -gc [PID]
    jmap -dump:format=b,file=heapdump.hprof [PID]
    
    # Analyze heap dump
    jhat heapdump.hprof
    

    Node.js Applications:

    # Monitor memory usage
    node --inspect app.js
    # Use Chrome DevTools for memory profiling
    
  11. Security and Authentication Issues

    ### Authentication Failures
    
    **Symptoms:**
    - 401 Unauthorized responses
    - Token validation errors
    - Session timeout issues
    
    **Diagnostic Steps:**
    1. Verify credentials and tokens
    2. Check token expiration
    3. Validate authentication service
    4. Review CORS configuration
    
    **Common Solutions:**
    - Refresh authentication tokens
    - Clear browser cookies/cache
    - Verify CORS headers
    - Check API key permissions
    
  12. Deployment and Configuration Issues

    ### Deployment Failures
    
    **Container Issues:**
    ```bash
    # Check container status
    docker ps -a
    docker logs container-name
    
    # Check resource limits
    docker stats
    
    # Debug container
    docker exec -it container-name /bin/bash
    

    Kubernetes Issues:

    # Check pod status
    kubectl get pods
    kubectl describe pod pod-name
    kubectl logs pod-name
    
    # Check service connectivity
    kubectl get svc
    kubectl port-forward pod-name 8080:8080
    
  13. Monitoring and Alerting Setup

    • Configure health checks and monitoring
    • Set up log aggregation and analysis
    • Implement alerting for critical issues
    • Create dashboards for system metrics
    • Document monitoring thresholds
  14. Escalation Procedures

    ## Escalation Matrix
    
    ### Severity Levels
    
    **Critical (P1):** System down, data loss
    - Immediate response required
    - Escalate to on-call engineer
    - Notify management within 30 minutes
    
    **High (P2):** Major functionality impaired
    - Response within 2 hours
    - Escalate to senior engineer
    - Provide hourly updates
    
    **Medium (P3):** Minor functionality issues
    - Response within 8 hours
    - Assign to appropriate team member
    - Provide daily updates
    
  15. Recovery Procedures

    • Document system recovery steps
    • Create data backup and restore procedures
    • Establish rollback procedures for deployments
    • Document disaster recovery processes
    • Test recovery procedures regularly
  16. Preventive Measures

    • Implement monitoring and alerting
    • Set up automated health checks
    • Create deployment validation procedures
    • Establish code review processes
    • Document maintenance procedures
  17. Knowledge Base Integration

    • Link to relevant documentation
    • Reference API documentation
    • Include links to monitoring dashboards
    • Connect to team communication channels
    • Integrate with ticketing systems
  18. Team Communication

    ## Communication Channels
    
    ### Immediate Response
    - Slack: #incidents channel
    - Phone: On-call rotation
    - Email: alerts@company.com
    
    ### Status Updates
    - Status page: status.company.com
    - Twitter: @company_status
    - Internal wiki: troubleshooting section
    
  19. Documentation Maintenance

    • Regular review and updates
    • Version control for troubleshooting guides
    • Feedback collection from users
    • Integration with incident post-mortems
    • Continuous improvement processes
  20. Self-Service Tools

    • Create diagnostic scripts and tools
    • Build automated recovery procedures
    • Implement self-healing systems
    • Provide user-friendly diagnostic interfaces
    • Create chatbot integration for common issues

Advanced Troubleshooting Techniques:

Log Analysis:

# Search for specific errors
grep -i "error" /var/log/app.log | tail -50

# Analyze log patterns
awk '{print $1}' access.log | sort | uniq -c | sort -nr

# Monitor logs in real-time
tail -f /var/log/app.log | grep -i "exception"

Performance Profiling:

# System performance
iostat -x 1
sar -u 1 10
vmstat 1 10

# Application profiling
strace -p [PID]
perf record -p [PID]

Remember to:

  • Keep troubleshooting guides up-to-date
  • Test all documented procedures regularly
  • Collect feedback from users and improve guides
  • Include screenshots and visual aids where helpful
  • Make guides searchable and well-organized

References

  • @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC framework context and documentation standards
  • @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/research-before-decision.md — Research-first for root cause analysis
  • @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/diagram-generation.md — System topology and architecture diagram standards
  • @$AIWG_ROOT/docs/cli-reference.md — CLI reference