git clone https://github.com/jmagly/aiwg
T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.agents/skills/troubleshooting-guide" ~/.claude/skills/jmagly-aiwg-troubleshooting-guide && rm -rf "$T"
.agents/skills/troubleshooting-guide/SKILL.mdTroubleshooting Guide Generator Command
Generate troubleshooting documentation
Instructions
Follow this systematic approach to create troubleshooting guides: $ARGUMENTS
-
System Overview and Architecture
- Document the system architecture and components
- Map out dependencies and integrations
- Identify critical paths and failure points
- Create system topology diagrams
- Document data flow and communication patterns
-
Common Issues Identification
- Collect historical support tickets and issues
- Interview team members about frequent problems
- Analyze error logs and monitoring data
- Review user feedback and complaints
- Identify patterns in system failures
-
Troubleshooting Framework
- Establish systematic diagnostic procedures
- Create problem isolation methodologies
- Document escalation paths and procedures
- Set up logging and monitoring checkpoints
- Define severity levels and response times
-
Diagnostic Tools and Commands
## Essential Diagnostic Commands ### System Health ```bash # Check system resources top # CPU and memory usage df -h # Disk space free -m # Memory usage netstat -tuln # Network connections # Application logs tail -f /var/log/app.log journalctl -u service-name -f # Database connectivity mysql -u user -p -e "SELECT 1" psql -h host -U user -d db -c "SELECT 1" -
Issue Categories and Solutions
Performance Issues:
### Slow Response Times **Symptoms:** - API responses > 5 seconds - User interface freezing - Database timeouts **Diagnostic Steps:** 1. Check system resources (CPU, memory, disk) 2. Review application logs for errors 3. Analyze database query performance 4. Check network connectivity and latency **Common Causes:** - Database connection pool exhaustion - Inefficient database queries - Memory leaks in application - Network bandwidth limitations **Solutions:** - Restart application services - Optimize database queries - Increase connection pool size - Scale infrastructure resources -
Error Code Documentation
## Error Code Reference ### HTTP Status Codes - **500 Internal Server Error** - Check application logs for stack traces - Verify database connectivity - Check environment variables - **404 Not Found** - Verify URL routing configuration - Check if resources exist - Review API endpoint documentation - **503 Service Unavailable** - Check service health status - Verify load balancer configuration - Check for maintenance mode -
Environment-Specific Issues
- Document development environment problems
- Address staging/testing environment issues
- Cover production-specific troubleshooting
- Include local development setup problems
-
Database Troubleshooting
### Database Connection Issues **Symptoms:** - "Connection refused" errors - "Too many connections" errors - Slow query performance **Diagnostic Commands:** ```sql -- Check active connections SHOW PROCESSLIST; -- Check database size SELECT table_schema, ROUND(SUM(data_length + index_length) / 1024 / 1024, 1) AS 'DB Size in MB' FROM information_schema.tables GROUP BY table_schema; -- Check slow queries SHOW VARIABLES LIKE 'slow_query_log'; -
Network and Connectivity Issues
### Network Troubleshooting **Basic Connectivity:** ```bash # Test basic connectivity ping example.com telnet host port curl -v https://api.example.com/health # DNS resolution nslookup example.com dig example.com # Network routing traceroute example.comSSL/TLS Issues:
# Check SSL certificate openssl s_client -connect example.com:443 curl -vI https://example.com -
Application-Specific Troubleshooting
Memory Issues:
### Out of Memory Errors **Java Applications:** ```bash # Check heap usage jstat -gc [PID] jmap -dump:format=b,file=heapdump.hprof [PID] # Analyze heap dump jhat heapdump.hprofNode.js Applications:
# Monitor memory usage node --inspect app.js # Use Chrome DevTools for memory profiling -
Security and Authentication Issues
### Authentication Failures **Symptoms:** - 401 Unauthorized responses - Token validation errors - Session timeout issues **Diagnostic Steps:** 1. Verify credentials and tokens 2. Check token expiration 3. Validate authentication service 4. Review CORS configuration **Common Solutions:** - Refresh authentication tokens - Clear browser cookies/cache - Verify CORS headers - Check API key permissions -
Deployment and Configuration Issues
### Deployment Failures **Container Issues:** ```bash # Check container status docker ps -a docker logs container-name # Check resource limits docker stats # Debug container docker exec -it container-name /bin/bashKubernetes Issues:
# Check pod status kubectl get pods kubectl describe pod pod-name kubectl logs pod-name # Check service connectivity kubectl get svc kubectl port-forward pod-name 8080:8080 -
Monitoring and Alerting Setup
- Configure health checks and monitoring
- Set up log aggregation and analysis
- Implement alerting for critical issues
- Create dashboards for system metrics
- Document monitoring thresholds
-
Escalation Procedures
## Escalation Matrix ### Severity Levels **Critical (P1):** System down, data loss - Immediate response required - Escalate to on-call engineer - Notify management within 30 minutes **High (P2):** Major functionality impaired - Response within 2 hours - Escalate to senior engineer - Provide hourly updates **Medium (P3):** Minor functionality issues - Response within 8 hours - Assign to appropriate team member - Provide daily updates -
Recovery Procedures
- Document system recovery steps
- Create data backup and restore procedures
- Establish rollback procedures for deployments
- Document disaster recovery processes
- Test recovery procedures regularly
-
Preventive Measures
- Implement monitoring and alerting
- Set up automated health checks
- Create deployment validation procedures
- Establish code review processes
- Document maintenance procedures
-
Knowledge Base Integration
- Link to relevant documentation
- Reference API documentation
- Include links to monitoring dashboards
- Connect to team communication channels
- Integrate with ticketing systems
-
Team Communication
## Communication Channels ### Immediate Response - Slack: #incidents channel - Phone: On-call rotation - Email: alerts@company.com ### Status Updates - Status page: status.company.com - Twitter: @company_status - Internal wiki: troubleshooting section -
Documentation Maintenance
- Regular review and updates
- Version control for troubleshooting guides
- Feedback collection from users
- Integration with incident post-mortems
- Continuous improvement processes
-
Self-Service Tools
- Create diagnostic scripts and tools
- Build automated recovery procedures
- Implement self-healing systems
- Provide user-friendly diagnostic interfaces
- Create chatbot integration for common issues
Advanced Troubleshooting Techniques:
Log Analysis:
# Search for specific errors grep -i "error" /var/log/app.log | tail -50 # Analyze log patterns awk '{print $1}' access.log | sort | uniq -c | sort -nr # Monitor logs in real-time tail -f /var/log/app.log | grep -i "exception"
Performance Profiling:
# System performance iostat -x 1 sar -u 1 10 vmstat 1 10 # Application profiling strace -p [PID] perf record -p [PID]
Remember to:
- Keep troubleshooting guides up-to-date
- Test all documented procedures regularly
- Collect feedback from users and improve guides
- Include screenshots and visual aids where helpful
- Make guides searchable and well-organized
References
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC framework context and documentation standards
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/research-before-decision.md — Research-first for root cause analysis
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/diagram-generation.md — System topology and architecture diagram standards
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference