Claude-skill-registry diagnose-with-must-gather
Collect and analyze OADP diagnostic data using oadp-must-gather to troubleshoot backup, restore, and deployment issues.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/diagnose-with-must-gather" ~/.claude/skills/majiayu000-claude-skill-registry-diagnose-with-must-gather && rm -rf "$T"
skills/data/diagnose-with-must-gather/SKILL.mdDiagnose with OADP Must-Gather
This skill provides comprehensive guidance for using the oadp-must-gather tool to collect diagnostic information and troubleshoot OADP operator issues, backup failures, restore problems, and deployment configuration errors.
When to Use This Skill
- Backup Failures: When backups are failing and you need comprehensive diagnostic data
- Restore Issues: Troubleshooting restore operations that aren't completing
- OADP Deployment Problems: Investigating OADP operator or component deployment failures
- Performance Issues: Diagnosing slow backup/restore operations
- Before Opening Support Cases: Collecting required diagnostic data for Red Hat support
- Configuration Validation: Verifying DPA and OADP configuration correctness
- Integration Problems: Debugging issues with cloud providers, CSI, or storage backends
What This Skill Does
- Runs must-gather Collection: Executes oadp-must-gather to capture diagnostic data
- Analyzes Collected Data: Examines logs, resources, and configurations
- Identifies Common Issues: Detects known failure patterns
- Provides Remediation: Suggests fixes for identified problems
- Extracts Key Information: Highlights critical errors and warnings
- Generates Reports: Summarizes findings for troubleshooting or support cases
How to Use
Basic Usage
Run oadp must-gather to diagnose backup failure
Analyze must-gather output for OADP deployment issues
Targeted Diagnosis
Collect must-gather for specific backup problem
Use must-gather to troubleshoot BSL unavailable issue
Prerequisites
- OpenShift cluster with oadp-must-gather image available
- oc CLI installed and logged in with cluster-admin or appropriate permissions
- Sufficient local disk space for must-gather output (typically 100MB-1GB)
- Knowledge of the issue symptoms and timeline
- Network connectivity to cluster API
Examples
Example 1: Basic Must-Gather Collection
User: "Run OADP must-gather to diagnose issues"
Skill Actions:
-
Run must-gather:
# Use latest oadp-must-gather image oc adm must-gather --image=quay.io/konveyor/oadp-must-gather:latest # Output will be in ./must-gather.local.<timestamp> directory -
Wait for collection:
# Must-gather will: # - Create collection pod # - Gather OADP operator logs # - Collect Velero deployment logs # - Capture BackupStorageLocation status # - Extract backup/restore resources # - Gather DPA configuration # - Collect node-agent (Kopia) or Restic daemonset logs # - Save CSI snapshot information # - Capture cluster version and platform details # Collection typically takes 2-5 minutes -
Verify collection completed:
# Check output directory ls -lh must-gather.local.*/ # Typical structure: # must-gather.local.XXXX/ # ├── cluster-scoped-resources/ # ├── namespaces/ # │ ├── openshift-adp/ # │ │ ├── pods/ # │ │ ├── deployments/ # │ │ ├── daemonsets/ # │ │ └── logs/ # │ └── ... # └── timestamp -
Initial analysis:
MUST_GATHER_DIR="must-gather.local.$(ls -t must-gather.local.* | head -1 | cut -d'.' -f3-)" # Check OADP operator logs for errors grep -i error $MUST_GATHER_DIR/namespaces/openshift-adp/pods/*/logs/*.log # Check Velero logs grep -i "error\|failed\|unable" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/*.log
Success Indicators:
- ✅ Must-gather pod completed successfully
- ✅ Output directory created with timestamp
- ✅ Logs and resources collected from openshift-adp namespace
- ✅ No collection errors in must-gather output
Example 2: Diagnose Backup Failure
User: "My backup is failing, use must-gather to find the problem"
Skill Actions:
-
Collect must-gather with backup context:
# Run must-gather oc adm must-gather --image=quay.io/konveyor/oadp-must-gather:latest # Note the backup name that's failing FAILED_BACKUP="myapp-backup-20250315" -
Analyze backup-specific data:
MUST_GATHER_DIR="must-gather.local.$(ls -t must-gather.local.* | head -1 | cut -d'.' -f3-)" # Find backup resource find $MUST_GATHER_DIR -name "backups.yaml" -exec grep -A20 "name: $FAILED_BACKUP" {} \; # Check backup phase and errors find $MUST_GATHER_DIR -name "backups.yaml" -exec grep -B2 -A10 "$FAILED_BACKUP" {} \; | grep -E "phase:|failureReason:|errors:" -
Check Velero logs for backup:
# Search Velero logs for the specific backup grep -r "$FAILED_BACKUP" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/ | grep -i "error\|failed" # Look for BSL connectivity issues grep -r "BackupStorageLocation" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/ | grep -i "unavailable\|error" -
Check node-agent/Kopia logs (if using file-level backups):
# For OADP 1.4+ (Kopia) find $MUST_GATHER_DIR -path "*/node-agent*/logs/*" -exec grep -l "$FAILED_BACKUP" {} \; | \ xargs grep -i "error\|failed" # For legacy Restic (OADP 1.3) find $MUST_GATHER_DIR -path "*/restic*/logs/*" -exec grep -l "$FAILED_BACKUP" {} \; | \ xargs grep -i "error\|failed" -
Examine BSL status:
# Check BackupStorageLocation configuration and status find $MUST_GATHER_DIR -name "backupstoragelocations.yaml" -exec cat {} \; # Look for BSL phase find $MUST_GATHER_DIR -name "backupstoragelocations.yaml" -exec grep -A5 "status:" {} \; | grep "phase:" -
Check DPA configuration:
# Examine DataProtectionApplication find $MUST_GATHER_DIR -name "dataprotectionapplications.yaml" -exec cat {} \; # Verify Kopia/Restic enabled find $MUST_GATHER_DIR -name "dataprotectionapplications.yaml" -exec grep -A3 "kopia:\|restic:" {} \;
Common Findings and Fixes:
Issue: BSL Unavailable
# Finding in backupstoragelocations.yaml: status: phase: Unavailable message: "NoSuchBucket: The specified bucket does not exist"
Fix: Verify S3 bucket exists, check credentials, confirm region setting
Issue: Volume Snapshot Timeout
Error in Velero logs: error backing up item: error executing backup action: rpc error: code = DeadlineExceeded
Fix: Increase timeout in DPA or check CSI driver health
Issue: Kopia/Restic Pod Not Running
# Finding: No node-agent pods $ find must-gather... -path "*/node-agent*/logs/*" (no results)
Fix: Enable Kopia in DPA configuration
Example 3: Diagnose Restore Failure
User: "Restore is stuck, help me troubleshoot with must-gather"
Skill Actions:
-
Collect must-gather:
oc adm must-gather --image=quay.io/konveyor/oadp-must-gather:latest FAILED_RESTORE="myapp-restore-20250315" -
Analyze restore status:
MUST_GATHER_DIR="must-gather.local.$(ls -t must-gather.local.* | head -1 | cut -d'.' -f3-)" # Find restore resource find $MUST_GATHER_DIR -name "restores.yaml" -exec grep -A30 "name: $FAILED_RESTORE" {} \; # Check restore phase find $MUST_GATHER_DIR -name "restores.yaml" -exec grep -B2 -A15 "$FAILED_RESTORE" {} \; | \ grep -E "phase:|warnings:|errors:|failureReason:" -
Check Velero restore logs:
# Search for restore-specific errors grep -r "$FAILED_RESTORE" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/ | \ grep -i "error\|warning|failed" # Look for resource restore issues grep -r "error restoring" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/ -
Check PVC restore status:
# Find PVC-related issues grep -r "persistentvolumeclaims" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/ | \ grep -i "error\|pending" # Check VolumeSnapshot issues find $MUST_GATHER_DIR -name "volumesnapshots.yaml" -exec cat {} \; -
Examine namespace mapping issues:
# Check if namespace exists find $MUST_GATHER_DIR -name "namespaces.yaml" -exec grep -A3 "metadata:" {} \; # Look for namespace conflicts grep -r "AlreadyExists" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/
Common Restore Issues:
Issue: StorageClass Not Available
Error: persistentvolumeclaims/my-pvc: error restoring: StorageClass "gp2" not found
Fix: Use
spec.restorePVs: false or map storage classes in restore spec
Issue: Existing Resources Conflict
Error: namespaces/myapp already exists
Fix: Delete existing namespace or use
existingResourcePolicy: update
Issue: VolumeSnapshot Not Ready
Warning: VolumeSnapshot "snap-123" is not ready
Fix: Check CSI driver status, verify snapshot completion
Example 4: Diagnose OADP Deployment Issues
User: "OADP operator isn't deploying properly, collect must-gather"
Skill Actions:
-
Run must-gather:
oc adm must-gather --image=quay.io/konveyor/oadp-must-gather:latest -
Check OADP operator status:
MUST_GATHER_DIR="must-gather.local.$(ls -t must-gather.local.* | head -1 | cut -d'.' -f3-)" # Check operator deployment find $MUST_GATHER_DIR -name "deployments.yaml" -path "*/openshift-adp/*" -exec cat {} \; # Check operator pod status find $MUST_GATHER_DIR -path "*/openshift-adp/pods/oadp-operator-*/logs/*" -exec cat {} \; | \ grep -i "error\|failed\|unable" -
Examine DPA reconciliation:
# Check DPA status find $MUST_GATHER_DIR -name "dataprotectionapplications.yaml" -exec grep -A20 "status:" {} \; # Look for reconciliation errors find $MUST_GATHER_DIR -path "*/oadp-operator-*/logs/*" -exec grep -i "reconcile.*error" {} \; -
Check Velero deployment:
# Verify Velero pods find $MUST_GATHER_DIR -name "deployments.yaml" -exec grep -A10 "name: velero" {} \; # Check for image pull errors find $MUST_GATHER_DIR -path "*/openshift-adp/pods/*/pods.yaml" -exec grep -i "ImagePullBackOff\|ErrImagePull" {} \; -
Verify secrets and credentials:
# Check if cloud credentials exist (without showing values) find $MUST_GATHER_DIR -name "secrets.yaml" -path "*/openshift-adp/*" -exec grep "name: cloud-credentials" {} \; # Verify BSL credentials referenced find $MUST_GATHER_DIR -name "dataprotectionapplications.yaml" -exec grep -A5 "credential:" {} \;
Common Deployment Issues:
Issue: Operator CrashLoopBackOff
Error in operator logs: panic: runtime error: invalid memory address or nil pointer dereference
Fix: Check for misconfigured DPA, upgrade to latest operator version
Issue: Velero Not Deploying
DPA status shows: conditions: - type: Reconciled status: "False" reason: Error message: "BSL not available"
Fix: Fix BSL configuration, verify credentials
Issue: Image Pull Errors
Pod status: containers: - image: quay.io/konveyor/velero:latest state: waiting reason: ImagePullBackOff
Fix: Check network connectivity, verify image exists, check pull secrets
Example 5: Analyze Must-Gather for Performance Issues
User: "Backups are very slow, analyze must-gather for performance bottlenecks"
Skill Actions:
-
Collect must-gather with timing context:
# Note current time and run collection date oc adm must-gather --image=quay.io/konveyor/oadp-must-gather:latest -
Check resource allocation:
MUST_GATHER_DIR="must-gather.local.$(ls -t must-gather.local.* | head -1 | cut -d'.' -f3-)" # Check Velero pod resources find $MUST_GATHER_DIR -name "deployments.yaml" -path "*/openshift-adp/*" -exec grep -A10 "resources:" {} \; # Check node-agent/Kopia resource limits find $MUST_GATHER_DIR -name "daemonsets.yaml" -exec grep -A10 "resources:" {} \; -
Examine DPA configuration for performance settings:
# Check parallel upload settings find $MUST_GATHER_DIR -name "dataprotectionapplications.yaml" -exec grep -E "uploaderConfig|parallelFilesUpload" {} \; # Check resource timeout settings find $MUST_GATHER_DIR -name "dataprotectionapplications.yaml" -exec grep -i "timeout" {} \; -
Analyze backup size and duration:
# Get backup details find $MUST_GATHER_DIR -name "backups.yaml" -exec grep -E "startTimestamp|completionTimestamp|progress" {} \; # Calculate backup durations (manual inspection) -
Check for throttling or rate limiting:
# Look for S3 throttling grep -r "RequestLimitExceeded\|SlowDown\|503" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/ # Check for CSI snapshot delays grep -r "snapshot.*timeout\|snapshot.*slow" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/
Performance Tuning Recommendations:
Finding: Low Velero CPU/Memory
# Current limits: resources: limits: cpu: 500m memory: 512Mi
Recommendation: Increase to 1-2 CPU, 1-2Gi memory for large clusters
Finding: Sequential File Uploads
# Missing parallel upload config
Recommendation: Add to DPA:
spec: configuration: velero: args: - "--uploader-parallel-files-upload=4"
Finding: Large Data Volumes
Backup includes 500GB+ of data via file-level backup
Recommendation: Use CSI snapshots instead of file-level backup for large volumes
Must-Gather Analysis Checklist
After collecting must-gather data, systematically review:
- OADP Operator Logs: Check for controller reconciliation errors
- Velero Deployment Status: Verify pods running and ready
- DPA Configuration: Validate all settings (BSL, VSL, plugins, Kopia/Restic)
- BSL Status: Confirm all BackupStorageLocations are Available
- VSL Configuration: Verify VolumeSnapshotLocations configured correctly
- Backup Resources: Examine failed/stuck backups for errors
- Restore Resources: Check restore status and warnings
- Node-Agent/Kopia Logs: Look for file-level backup errors
- CSI Snapshot Status: Verify VolumeSnapshots completing
- Cluster Platform: Note OpenShift version and infrastructure type
- Network Connectivity: Check for BSL connection failures
- Resource Constraints: Verify adequate CPU/memory allocated
Common Error Patterns
Pattern 1: BSL Connectivity Issues
Symptoms in must-gather:
BackupStorageLocation phase: Unavailable Velero logs: "error getting backup store"
Root Causes:
- Invalid credentials
- Wrong S3 endpoint or region
- Network policy blocking egress
- Bucket doesn't exist or wrong name
Diagnostic Commands:
# From must-gather output find $MUST_GATHER_DIR -name "backupstoragelocations.yaml" -exec cat {} \; grep -r "backup store" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/
Pattern 2: CSI Snapshot Failures
Symptoms in must-gather:
VolumeSnapshot status: Error Backup logs: "error creating snapshot"
Root Causes:
- CSI driver not installed or not ready
- VolumeSnapshotClass missing or incorrect
- Storage backend doesn't support snapshots
- Snapshot quota exceeded
Diagnostic Commands:
find $MUST_GATHER_DIR -name "volumesnapshotclasses.yaml" -exec cat {} \; find $MUST_GATHER_DIR -name "volumesnapshots.yaml" -exec grep -A10 "status:" {} \;
Pattern 3: File-Level Backup Hangs
Symptoms in must-gather:
Backup phase: InProgress (stuck for hours) Node-agent logs show no recent activity
Root Causes:
- Node-agent pod not running on backup source node
- Very large files causing timeouts
- Insufficient resources (CPU/memory)
- Network issues to BSL
Diagnostic Commands:
# Check node-agent pod distribution find $MUST_GATHER_DIR -name "daemonsets.yaml" -exec grep -A5 "numberReady" {} \; # Check for timeout errors grep -r "timeout\|deadline exceeded" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/node-agent-*/logs/
Pattern 4: DPA Reconciliation Failures
Symptoms in must-gather:
DPA status: Not reconciled Operator logs: "reconcile error"
Root Causes:
- Invalid DPA configuration
- Missing required fields
- Plugin compatibility issues
- Operator bug
Diagnostic Commands:
find $MUST_GATHER_DIR -name "dataprotectionapplications.yaml" -exec grep -A20 "status:" {} \; find $MUST_GATHER_DIR -path "*/oadp-operator-*/logs/*" -exec grep "reconcile" {} \; | grep -i error
Advanced Must-Gather Analysis
Extracting Specific Time Ranges
# Find logs from specific time period MUST_GATHER_DIR="must-gather.local.XXXX" # Example: Errors between 14:00 and 15:00 UTC grep -r "2025-03-15T14:\|2025-03-15T15:" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/*/logs/ | \ grep -i error
Comparing DPA vs Actual Deployment
# Extract DPA desired configuration find $MUST_GATHER_DIR -name "dataprotectionapplications.yaml" -exec cat {} \; > dpa-config.yaml # Extract actual Velero deployment find $MUST_GATHER_DIR -name "deployments.yaml" -path "*/openshift-adp/*" -exec cat {} \; > actual-deployment.yaml # Compare (manual review) diff -u dpa-config.yaml actual-deployment.yaml
Correlating Events Across Components
# Create timeline of events grep -r "timestamp\|time" $MUST_GATHER_DIR/namespaces/openshift-adp/ | sort
Best Practices
-
Collect Immediately After Failure
- Run must-gather as soon as issue occurs
- Logs may rotate, losing critical information
- Capture state while problem is still present
-
Provide Context in Support Cases
- Include symptoms and timeline
- Note what changed recently
- Specify OADP version and platform
- Attach entire must-gather archive
-
Organize Multiple Collections
# Rename must-gather directories meaningfully mv must-gather.local.12345678 must-gather-backup-failure-2025-03-15 # Keep collections for comparison diff -r must-gather-before/ must-gather-after/ -
Redact Sensitive Information Before Sharing
# Remove credentials from collected data (for public sharing) # Note: Red Hat support needs unredacted must-gather # Find and review secrets find must-gather.local.XXXX -name "secrets.yaml" -exec cat {} \; # Consider: Don't share must-gather publicly, only with Red Hat support -
Automate Analysis
# Create analysis script cat << 'EOF' > analyze-oadp-must-gather.sh #!/bin/bash MUST_GATHER_DIR=$1 echo "=== OADP Operator Status ===" find $MUST_GATHER_DIR -path "*/oadp-operator-*/logs/*" | xargs grep -i "error\|failed" | head -20 echo -e "\n=== BSL Status ===" find $MUST_GATHER_DIR -name "backupstoragelocations.yaml" -exec grep "phase:" {} \; echo -e "\n=== Recent Backup Failures ===" find $MUST_GATHER_DIR -name "backups.yaml" -exec grep -B2 "phase: Failed" {} \; echo -e "\n=== Velero Errors ===" find $MUST_GATHER_DIR -path "*/velero-*/logs/*" | xargs grep -i "error" | head -20 EOF chmod +x analyze-oadp-must-gather.sh ./analyze-oadp-must-gather.sh must-gather.local.XXXX
Troubleshooting Must-Gather Collection
Must-Gather Pod Fails
Symptoms: Must-gather pod errors or doesn't start
Diagnosis:
# Check must-gather pod status oc get pods -A | grep must-gather # View must-gather pod logs oc logs -n openshift-must-gather-<random> must-gather-<id>
Common Fixes:
- Ensure sufficient permissions (cluster-admin or must-gather role)
- Check node resources availability
- Verify must-gather image accessibility
- Check for network policies blocking pod creation
Incomplete Data Collection
Symptoms: Must-gather completes but missing expected logs
Possible Causes:
- Pods were not running during collection
- Namespace permissions issues
- Collection timeout
Solution:
# Run with extended timeout oc adm must-gather --image=quay.io/konveyor/oadp-must-gather:latest -- /usr/bin/gather --timeout=10m
Integration with Support Cases
When opening a Red Hat support case:
- Collect must-gather using latest image
- Archive and compress:
tar czf oadp-must-gather-$(date +%Y%m%d).tar.gz must-gather.local.*/ - Attach to case via Red Hat Customer Portal
- Include:
- Description of issue
- Steps to reproduce
- OADP version
- OpenShift version
- Cloud provider/platform
- Timeline of issue
Next Steps
After analyzing must-gather:
- Apply Fixes: Implement identified remediation steps
- Retest: Verify issue resolved
- Collect New Must-Gather: Confirm fix worked
- Update Documentation: Record solution for future reference
- Open Support Case: If issue persists or is unclear
Related Skills:
- diagnose-backup-issues - Additional backup troubleshooting techniques
- install-oadp - Proper OADP installation to avoid deployment issues
- create-backup - Backup creation best practices
Resources
- OpenShift Must-Gather Documentation: https://docs.openshift.com/container-platform/latest/support/gathering-cluster-data.html
- OADP Must-Gather GitHub: https://github.com/openshift/oadp-must-gather
- Red Hat Support: https://access.redhat.com/support
Version: 1.0 Last Updated: 2025-11-17 Compatibility: OADP 1.3+, OpenShift 4.12+ CRITICAL: Must-gather is the primary diagnostic tool for OADP issues - use it early and often