Aiwg troubleshooting-guide

Generate troubleshooting documentation

install

source · Clone the upstream repo

git clone https://github.com/jmagly/aiwg

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.agents/skills/troubleshooting-guide" ~/.claude/skills/jmagly-aiwg-troubleshooting-guide && rm -rf "$T"

manifest: .agents/skills/troubleshooting-guide/SKILL.md

source content

Troubleshooting Guide Generator Command

Generate troubleshooting documentation

Instructions

Follow this systematic approach to create troubleshooting guides: $ARGUMENTS

System Overview and Architecture
- Document the system architecture and components
- Map out dependencies and integrations
- Identify critical paths and failure points
- Create system topology diagrams
- Document data flow and communication patterns
Common Issues Identification
- Collect historical support tickets and issues
- Interview team members about frequent problems
- Analyze error logs and monitoring data
- Review user feedback and complaints
- Identify patterns in system failures
Troubleshooting Framework
- Establish systematic diagnostic procedures
- Create problem isolation methodologies
- Document escalation paths and procedures
- Set up logging and monitoring checkpoints
- Define severity levels and response times

Diagnostic Tools and Commands

## Essential Diagnostic Commands

### System Health
```bash
# Check system resources
top                    # CPU and memory usage
df -h                 # Disk space
free -m               # Memory usage
netstat -tuln         # Network connections

# Application logs
tail -f /var/log/app.log
journalctl -u service-name -f

# Database connectivity
mysql -u user -p -e "SELECT 1"
psql -h host -U user -d db -c "SELECT 1"

Issue Categories and Solutions

Performance Issues:

### Slow Response Times

**Symptoms:**
- API responses > 5 seconds
- User interface freezing
- Database timeouts

**Diagnostic Steps:**
1. Check system resources (CPU, memory, disk)
2. Review application logs for errors
3. Analyze database query performance
4. Check network connectivity and latency

**Common Causes:**
- Database connection pool exhaustion
- Inefficient database queries
- Memory leaks in application
- Network bandwidth limitations

**Solutions:**
- Restart application services
- Optimize database queries
- Increase connection pool size
- Scale infrastructure resources

Error Code Documentation

## Error Code Reference

### HTTP Status Codes
- **500 Internal Server Error**
  - Check application logs for stack traces
  - Verify database connectivity
  - Check environment variables

- **404 Not Found**
  - Verify URL routing configuration
  - Check if resources exist
  - Review API endpoint documentation

- **503 Service Unavailable**
  - Check service health status
  - Verify load balancer configuration
  - Check for maintenance mode

Environment-Specific Issues
- Document development environment problems
- Address staging/testing environment issues
- Cover production-specific troubleshooting
- Include local development setup problems

Database Troubleshooting

### Database Connection Issues

**Symptoms:**
- "Connection refused" errors
- "Too many connections" errors
- Slow query performance

**Diagnostic Commands:**
```sql
-- Check active connections
SHOW PROCESSLIST;

-- Check database size
SELECT table_schema, 
       ROUND(SUM(data_length + index_length) / 1024 / 1024, 1) AS 'DB Size in MB' 
FROM information_schema.tables 
GROUP BY table_schema;

-- Check slow queries
SHOW VARIABLES LIKE 'slow_query_log';

Network and Connectivity Issues

### Network Troubleshooting

**Basic Connectivity:**
```bash
# Test basic connectivity
ping example.com
telnet host port
curl -v https://api.example.com/health

# DNS resolution
nslookup example.com
dig example.com

# Network routing
traceroute example.com

SSL/TLS Issues:

# Check SSL certificate
openssl s_client -connect example.com:443
curl -vI https://example.com

Application-Specific Troubleshooting

Memory Issues:

### Out of Memory Errors

**Java Applications:**
```bash
# Check heap usage
jstat -gc [PID]
jmap -dump:format=b,file=heapdump.hprof [PID]

# Analyze heap dump
jhat heapdump.hprof

Node.js Applications:

# Monitor memory usage
node --inspect app.js
# Use Chrome DevTools for memory profiling

Security and Authentication Issues

### Authentication Failures

**Symptoms:**
- 401 Unauthorized responses
- Token validation errors
- Session timeout issues

**Diagnostic Steps:**
1. Verify credentials and tokens
2. Check token expiration
3. Validate authentication service
4. Review CORS configuration

**Common Solutions:**
- Refresh authentication tokens
- Clear browser cookies/cache
- Verify CORS headers
- Check API key permissions

Deployment and Configuration Issues

### Deployment Failures

**Container Issues:**
```bash
# Check container status
docker ps -a
docker logs container-name

# Check resource limits
docker stats

# Debug container
docker exec -it container-name /bin/bash

Kubernetes Issues:

# Check pod status
kubectl get pods
kubectl describe pod pod-name
kubectl logs pod-name

# Check service connectivity
kubectl get svc
kubectl port-forward pod-name 8080:8080

Monitoring and Alerting Setup
- Configure health checks and monitoring
- Set up log aggregation and analysis
- Implement alerting for critical issues
- Create dashboards for system metrics
- Document monitoring thresholds

Escalation Procedures

## Escalation Matrix

### Severity Levels

**Critical (P1):** System down, data loss
- Immediate response required
- Escalate to on-call engineer
- Notify management within 30 minutes

**High (P2):** Major functionality impaired
- Response within 2 hours
- Escalate to senior engineer
- Provide hourly updates

**Medium (P3):** Minor functionality issues
- Response within 8 hours
- Assign to appropriate team member
- Provide daily updates

Recovery Procedures
- Document system recovery steps
- Create data backup and restore procedures
- Establish rollback procedures for deployments
- Document disaster recovery processes
- Test recovery procedures regularly
Preventive Measures
- Implement monitoring and alerting
- Set up automated health checks
- Create deployment validation procedures
- Establish code review processes
- Document maintenance procedures
Knowledge Base Integration
- Link to relevant documentation
- Reference API documentation
- Include links to monitoring dashboards
- Connect to team communication channels
- Integrate with ticketing systems

Team Communication

## Communication Channels

### Immediate Response
- Slack: #incidents channel
- Phone: On-call rotation
- Email: alerts@company.com

### Status Updates
- Status page: status.company.com
- Twitter: @company_status
- Internal wiki: troubleshooting section

Documentation Maintenance
- Regular review and updates
- Version control for troubleshooting guides
- Feedback collection from users
- Integration with incident post-mortems
- Continuous improvement processes
Self-Service Tools
- Create diagnostic scripts and tools
- Build automated recovery procedures
- Implement self-healing systems
- Provide user-friendly diagnostic interfaces
- Create chatbot integration for common issues

Advanced Troubleshooting Techniques:

Log Analysis:

# Search for specific errors
grep -i "error" /var/log/app.log | tail -50

# Analyze log patterns
awk '{print $1}' access.log | sort | uniq -c | sort -nr

# Monitor logs in real-time
tail -f /var/log/app.log | grep -i "exception"

Performance Profiling:

# System performance
iostat -x 1
sar -u 1 10
vmstat 1 10

# Application profiling
strace -p [PID]
perf record -p [PID]

Remember to:

Keep troubleshooting guides up-to-date
Test all documented procedures regularly
Collect feedback from users and improve guides
Include screenshots and visual aids where helpful
Make guides searchable and well-organized

References

@$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC framework context and documentation standards
@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/research-before-decision.md — Research-first for root cause analysis
@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/diagram-generation.md — System topology and architecture diagram standards
@$AIWG_ROOT/docs/cli-reference.md — CLI reference