Claude-skill-registry diagnose-with-must-gather

Collect and analyze OADP diagnostic data using oadp-must-gather to troubleshoot backup, restore, and deployment issues.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/diagnose-with-must-gather" ~/.claude/skills/majiayu000-claude-skill-registry-diagnose-with-must-gather && rm -rf "$T"

manifest: skills/data/diagnose-with-must-gather/SKILL.md

Diagnose with OADP Must-Gather

This skill provides comprehensive guidance for using the oadp-must-gather tool to collect diagnostic information and troubleshoot OADP operator issues, backup failures, restore problems, and deployment configuration errors.

When to Use This Skill

Backup Failures: When backups are failing and you need comprehensive diagnostic data
Restore Issues: Troubleshooting restore operations that aren't completing
OADP Deployment Problems: Investigating OADP operator or component deployment failures
Performance Issues: Diagnosing slow backup/restore operations
Before Opening Support Cases: Collecting required diagnostic data for Red Hat support
Configuration Validation: Verifying DPA and OADP configuration correctness
Integration Problems: Debugging issues with cloud providers, CSI, or storage backends

What This Skill Does

Runs must-gather Collection: Executes oadp-must-gather to capture diagnostic data
Analyzes Collected Data: Examines logs, resources, and configurations
Identifies Common Issues: Detects known failure patterns
Provides Remediation: Suggests fixes for identified problems
Extracts Key Information: Highlights critical errors and warnings
Generates Reports: Summarizes findings for troubleshooting or support cases

How to Use

Basic Usage

Run oadp must-gather to diagnose backup failure

Analyze must-gather output for OADP deployment issues

Targeted Diagnosis

Collect must-gather for specific backup problem

Use must-gather to troubleshoot BSL unavailable issue

Prerequisites

OpenShift cluster with oadp-must-gather image available
oc CLI installed and logged in with cluster-admin or appropriate permissions
Sufficient local disk space for must-gather output (typically 100MB-1GB)
Knowledge of the issue symptoms and timeline
Network connectivity to cluster API

Examples

Example 1: Basic Must-Gather Collection

User: "Run OADP must-gather to diagnose issues"

Skill Actions:

Run must-gather:

# Use latest oadp-must-gather image
oc adm must-gather --image=quay.io/konveyor/oadp-must-gather:latest

# Output will be in ./must-gather.local.<timestamp> directory

Wait for collection:

# Must-gather will:
# - Create collection pod
# - Gather OADP operator logs
# - Collect Velero deployment logs
# - Capture BackupStorageLocation status
# - Extract backup/restore resources
# - Gather DPA configuration
# - Collect node-agent (Kopia) or Restic daemonset logs
# - Save CSI snapshot information
# - Capture cluster version and platform details

# Collection typically takes 2-5 minutes

Verify collection completed:

# Check output directory
ls -lh must-gather.local.*/

# Typical structure:
# must-gather.local.XXXX/
# ├── cluster-scoped-resources/
# ├── namespaces/
# │   ├── openshift-adp/
# │   │   ├── pods/
# │   │   ├── deployments/
# │   │   ├── daemonsets/
# │   │   └── logs/
# │   └── ...
# └── timestamp

Initial analysis:

MUST_GATHER_DIR="must-gather.local.$(ls -t must-gather.local.* | head -1 | cut -d'.' -f3-)"

# Check OADP operator logs for errors
grep -i error $MUST_GATHER_DIR/namespaces/openshift-adp/pods/*/logs/*.log

# Check Velero logs
grep -i "error\|failed\|unable" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/*.log

Success Indicators:

✅ Must-gather pod completed successfully
✅ Output directory created with timestamp
✅ Logs and resources collected from openshift-adp namespace
✅ No collection errors in must-gather output

Example 2: Diagnose Backup Failure

User: "My backup is failing, use must-gather to find the problem"

Skill Actions:

Collect must-gather with backup context:

# Run must-gather
oc adm must-gather --image=quay.io/konveyor/oadp-must-gather:latest

# Note the backup name that's failing
FAILED_BACKUP="myapp-backup-20250315"

Analyze backup-specific data:

MUST_GATHER_DIR="must-gather.local.$(ls -t must-gather.local.* | head -1 | cut -d'.' -f3-)"

# Find backup resource
find $MUST_GATHER_DIR -name "backups.yaml" -exec grep -A20 "name: $FAILED_BACKUP" {} \;

# Check backup phase and errors
find $MUST_GATHER_DIR -name "backups.yaml" -exec grep -B2 -A10 "$FAILED_BACKUP" {} \; | grep -E "phase:|failureReason:|errors:"

Check Velero logs for backup:

# Search Velero logs for the specific backup
grep -r "$FAILED_BACKUP" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/ | grep -i "error\|failed"

# Look for BSL connectivity issues
grep -r "BackupStorageLocation" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/ | grep -i "unavailable\|error"

Check node-agent/Kopia logs (if using file-level backups):

# For OADP 1.4+ (Kopia)
find $MUST_GATHER_DIR -path "*/node-agent*/logs/*" -exec grep -l "$FAILED_BACKUP" {} \; | \
  xargs grep -i "error\|failed"

# For legacy Restic (OADP 1.3)
find $MUST_GATHER_DIR -path "*/restic*/logs/*" -exec grep -l "$FAILED_BACKUP" {} \; | \
  xargs grep -i "error\|failed"

Examine BSL status:

# Check BackupStorageLocation configuration and status
find $MUST_GATHER_DIR -name "backupstoragelocations.yaml" -exec cat {} \;

# Look for BSL phase
find $MUST_GATHER_DIR -name "backupstoragelocations.yaml" -exec grep -A5 "status:" {} \; | grep "phase:"

Check DPA configuration:

# Examine DataProtectionApplication
find $MUST_GATHER_DIR -name "dataprotectionapplications.yaml" -exec cat {} \;

# Verify Kopia/Restic enabled
find $MUST_GATHER_DIR -name "dataprotectionapplications.yaml" -exec grep -A3 "kopia:\|restic:" {} \;

Common Findings and Fixes:

Issue: BSL Unavailable

# Finding in backupstoragelocations.yaml:
status:
  phase: Unavailable
  message: "NoSuchBucket: The specified bucket does not exist"

Fix: Verify S3 bucket exists, check credentials, confirm region setting

Issue: Volume Snapshot Timeout

Error in Velero logs:
error backing up item: error executing backup action: rpc error: code = DeadlineExceeded

Fix: Increase timeout in DPA or check CSI driver health

Issue: Kopia/Restic Pod Not Running

# Finding: No node-agent pods
$ find must-gather... -path "*/node-agent*/logs/*"
(no results)

Fix: Enable Kopia in DPA configuration

Example 3: Diagnose Restore Failure

User: "Restore is stuck, help me troubleshoot with must-gather"

Skill Actions:

Collect must-gather:

oc adm must-gather --image=quay.io/konveyor/oadp-must-gather:latest

FAILED_RESTORE="myapp-restore-20250315"

Analyze restore status:

MUST_GATHER_DIR="must-gather.local.$(ls -t must-gather.local.* | head -1 | cut -d'.' -f3-)"

# Find restore resource
find $MUST_GATHER_DIR -name "restores.yaml" -exec grep -A30 "name: $FAILED_RESTORE" {} \;

# Check restore phase
find $MUST_GATHER_DIR -name "restores.yaml" -exec grep -B2 -A15 "$FAILED_RESTORE" {} \; | \
  grep -E "phase:|warnings:|errors:|failureReason:"

Check Velero restore logs:

# Search for restore-specific errors
grep -r "$FAILED_RESTORE" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/ | \
  grep -i "error\|warning|failed"

# Look for resource restore issues
grep -r "error restoring" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/

Check PVC restore status:

# Find PVC-related issues
grep -r "persistentvolumeclaims" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/ | \
  grep -i "error\|pending"

# Check VolumeSnapshot issues
find $MUST_GATHER_DIR -name "volumesnapshots.yaml" -exec cat {} \;

Examine namespace mapping issues:

# Check if namespace exists
find $MUST_GATHER_DIR -name "namespaces.yaml" -exec grep -A3 "metadata:" {} \;

# Look for namespace conflicts
grep -r "AlreadyExists" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/

Common Restore Issues:

Issue: StorageClass Not Available

Error: persistentvolumeclaims/my-pvc: error restoring: StorageClass "gp2" not found

Fix: Use

spec.restorePVs: false

or map storage classes in restore spec

Issue: Existing Resources Conflict

Error: namespaces/myapp already exists

Fix: Delete existing namespace or use

existingResourcePolicy: update

Issue: VolumeSnapshot Not Ready

Warning: VolumeSnapshot "snap-123" is not ready

Fix: Check CSI driver status, verify snapshot completion

Example 4: Diagnose OADP Deployment Issues

User: "OADP operator isn't deploying properly, collect must-gather"

Skill Actions:

Run must-gather:

oc adm must-gather --image=quay.io/konveyor/oadp-must-gather:latest

Check OADP operator status:

MUST_GATHER_DIR="must-gather.local.$(ls -t must-gather.local.* | head -1 | cut -d'.' -f3-)"

# Check operator deployment
find $MUST_GATHER_DIR -name "deployments.yaml" -path "*/openshift-adp/*" -exec cat {} \;

# Check operator pod status
find $MUST_GATHER_DIR -path "*/openshift-adp/pods/oadp-operator-*/logs/*" -exec cat {} \; | \
  grep -i "error\|failed\|unable"

Examine DPA reconciliation:

# Check DPA status
find $MUST_GATHER_DIR -name "dataprotectionapplications.yaml" -exec grep -A20 "status:" {} \;

# Look for reconciliation errors
find $MUST_GATHER_DIR -path "*/oadp-operator-*/logs/*" -exec grep -i "reconcile.*error" {} \;

Check Velero deployment:

# Verify Velero pods
find $MUST_GATHER_DIR -name "deployments.yaml" -exec grep -A10 "name: velero" {} \;

# Check for image pull errors
find $MUST_GATHER_DIR -path "*/openshift-adp/pods/*/pods.yaml" -exec grep -i "ImagePullBackOff\|ErrImagePull" {} \;

Verify secrets and credentials:

# Check if cloud credentials exist (without showing values)
find $MUST_GATHER_DIR -name "secrets.yaml" -path "*/openshift-adp/*" -exec grep "name: cloud-credentials" {} \;

# Verify BSL credentials referenced
find $MUST_GATHER_DIR -name "dataprotectionapplications.yaml" -exec grep -A5 "credential:" {} \;

Common Deployment Issues:

Issue: Operator CrashLoopBackOff

Error in operator logs:
panic: runtime error: invalid memory address or nil pointer dereference

Fix: Check for misconfigured DPA, upgrade to latest operator version

Issue: Velero Not Deploying

DPA status shows:
conditions:
  - type: Reconciled
    status: "False"
    reason: Error
    message: "BSL not available"

Fix: Fix BSL configuration, verify credentials

Issue: Image Pull Errors

Pod status:
containers:
  - image: quay.io/konveyor/velero:latest
    state: waiting
      reason: ImagePullBackOff

Fix: Check network connectivity, verify image exists, check pull secrets

Example 5: Analyze Must-Gather for Performance Issues

User: "Backups are very slow, analyze must-gather for performance bottlenecks"

Skill Actions:

Collect must-gather with timing context:

# Note current time and run collection
date
oc adm must-gather --image=quay.io/konveyor/oadp-must-gather:latest

Check resource allocation:

MUST_GATHER_DIR="must-gather.local.$(ls -t must-gather.local.* | head -1 | cut -d'.' -f3-)"

# Check Velero pod resources
find $MUST_GATHER_DIR -name "deployments.yaml" -path "*/openshift-adp/*" -exec grep -A10 "resources:" {} \;

# Check node-agent/Kopia resource limits
find $MUST_GATHER_DIR -name "daemonsets.yaml" -exec grep -A10 "resources:" {} \;

Examine DPA configuration for performance settings:

# Check parallel upload settings
find $MUST_GATHER_DIR -name "dataprotectionapplications.yaml" -exec grep -E "uploaderConfig|parallelFilesUpload" {} \;

# Check resource timeout settings
find $MUST_GATHER_DIR -name "dataprotectionapplications.yaml" -exec grep -i "timeout" {} \;

Analyze backup size and duration:

# Get backup details
find $MUST_GATHER_DIR -name "backups.yaml" -exec grep -E "startTimestamp|completionTimestamp|progress" {} \;

# Calculate backup durations (manual inspection)

Check for throttling or rate limiting:

# Look for S3 throttling
grep -r "RequestLimitExceeded\|SlowDown\|503" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/

# Check for CSI snapshot delays
grep -r "snapshot.*timeout\|snapshot.*slow" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/

Performance Tuning Recommendations:

Finding: Low Velero CPU/Memory

# Current limits:
resources:
  limits:
    cpu: 500m
    memory: 512Mi

Recommendation: Increase to 1-2 CPU, 1-2Gi memory for large clusters

Finding: Sequential File Uploads

# Missing parallel upload config

Recommendation: Add to DPA:

spec:
  configuration:
    velero:
      args:
        - "--uploader-parallel-files-upload=4"

Finding: Large Data Volumes

Backup includes 500GB+ of data via file-level backup

Recommendation: Use CSI snapshots instead of file-level backup for large volumes

Must-Gather Analysis Checklist

After collecting must-gather data, systematically review:

Common Error Patterns

Pattern 1: BSL Connectivity Issues

Symptoms in must-gather:

BackupStorageLocation phase: Unavailable
Velero logs: "error getting backup store"

Root Causes:

Invalid credentials
Wrong S3 endpoint or region
Network policy blocking egress
Bucket doesn't exist or wrong name

Diagnostic Commands:

# From must-gather output
find $MUST_GATHER_DIR -name "backupstoragelocations.yaml" -exec cat {} \;
grep -r "backup store" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/velero-*/logs/

Pattern 2: CSI Snapshot Failures

Symptoms in must-gather:

VolumeSnapshot status: Error
Backup logs: "error creating snapshot"

Root Causes:

CSI driver not installed or not ready
VolumeSnapshotClass missing or incorrect
Storage backend doesn't support snapshots
Snapshot quota exceeded

Diagnostic Commands:

find $MUST_GATHER_DIR -name "volumesnapshotclasses.yaml" -exec cat {} \;
find $MUST_GATHER_DIR -name "volumesnapshots.yaml" -exec grep -A10 "status:" {} \;

Pattern 3: File-Level Backup Hangs

Symptoms in must-gather:

Backup phase: InProgress (stuck for hours)
Node-agent logs show no recent activity

Root Causes:

Node-agent pod not running on backup source node
Very large files causing timeouts
Insufficient resources (CPU/memory)
Network issues to BSL

Diagnostic Commands:

# Check node-agent pod distribution
find $MUST_GATHER_DIR -name "daemonsets.yaml" -exec grep -A5 "numberReady" {} \;

# Check for timeout errors
grep -r "timeout\|deadline exceeded" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/node-agent-*/logs/

Pattern 4: DPA Reconciliation Failures

Symptoms in must-gather:

DPA status: Not reconciled
Operator logs: "reconcile error"

Root Causes:

Invalid DPA configuration
Missing required fields
Plugin compatibility issues
Operator bug

Diagnostic Commands:

find $MUST_GATHER_DIR -name "dataprotectionapplications.yaml" -exec grep -A20 "status:" {} \;
find $MUST_GATHER_DIR -path "*/oadp-operator-*/logs/*" -exec grep "reconcile" {} \; | grep -i error

Advanced Must-Gather Analysis

Extracting Specific Time Ranges

# Find logs from specific time period
MUST_GATHER_DIR="must-gather.local.XXXX"

# Example: Errors between 14:00 and 15:00 UTC
grep -r "2025-03-15T14:\|2025-03-15T15:" $MUST_GATHER_DIR/namespaces/openshift-adp/pods/*/logs/ | \
  grep -i error

Comparing DPA vs Actual Deployment

# Extract DPA desired configuration
find $MUST_GATHER_DIR -name "dataprotectionapplications.yaml" -exec cat {} \; > dpa-config.yaml

# Extract actual Velero deployment
find $MUST_GATHER_DIR -name "deployments.yaml" -path "*/openshift-adp/*" -exec cat {} \; > actual-deployment.yaml

# Compare (manual review)
diff -u dpa-config.yaml actual-deployment.yaml

Correlating Events Across Components

# Create timeline of events
grep -r "timestamp\|time" $MUST_GATHER_DIR/namespaces/openshift-adp/ | sort

Best Practices

Collect Immediately After Failure
- Run must-gather as soon as issue occurs
- Logs may rotate, losing critical information
- Capture state while problem is still present
Provide Context in Support Cases
- Include symptoms and timeline
- Note what changed recently
- Specify OADP version and platform
- Attach entire must-gather archive

Organize Multiple Collections

# Rename must-gather directories meaningfully
mv must-gather.local.12345678 must-gather-backup-failure-2025-03-15

# Keep collections for comparison
diff -r must-gather-before/ must-gather-after/

Redact Sensitive Information Before Sharing

# Remove credentials from collected data (for public sharing)
# Note: Red Hat support needs unredacted must-gather

# Find and review secrets
find must-gather.local.XXXX -name "secrets.yaml" -exec cat {} \;

# Consider: Don't share must-gather publicly, only with Red Hat support

Automate Analysis

# Create analysis script
cat << 'EOF' > analyze-oadp-must-gather.sh
#!/bin/bash
MUST_GATHER_DIR=$1

echo "=== OADP Operator Status ==="
find $MUST_GATHER_DIR -path "*/oadp-operator-*/logs/*" | xargs grep -i "error\|failed" | head -20

echo -e "\n=== BSL Status ==="
find $MUST_GATHER_DIR -name "backupstoragelocations.yaml" -exec grep "phase:" {} \;

echo -e "\n=== Recent Backup Failures ==="
find $MUST_GATHER_DIR -name "backups.yaml" -exec grep -B2 "phase: Failed" {} \;

echo -e "\n=== Velero Errors ==="
find $MUST_GATHER_DIR -path "*/velero-*/logs/*" | xargs grep -i "error" | head -20
EOF

chmod +x analyze-oadp-must-gather.sh
./analyze-oadp-must-gather.sh must-gather.local.XXXX

Troubleshooting Must-Gather Collection

Must-Gather Pod Fails

Symptoms: Must-gather pod errors or doesn't start

Diagnosis:

# Check must-gather pod status
oc get pods -A | grep must-gather

# View must-gather pod logs
oc logs -n openshift-must-gather-<random> must-gather-<id>

Common Fixes:

Ensure sufficient permissions (cluster-admin or must-gather role)
Check node resources availability
Verify must-gather image accessibility
Check for network policies blocking pod creation

Incomplete Data Collection

Symptoms: Must-gather completes but missing expected logs

Possible Causes:

Pods were not running during collection
Namespace permissions issues
Collection timeout

Solution:

# Run with extended timeout
oc adm must-gather --image=quay.io/konveyor/oadp-must-gather:latest -- /usr/bin/gather --timeout=10m

Integration with Support Cases

When opening a Red Hat support case:

Collect must-gather using latest image

Archive and compress:

tar czf oadp-must-gather-$(date +%Y%m%d).tar.gz must-gather.local.*/

Attach to case via Red Hat Customer Portal
Include:
- Description of issue
- Steps to reproduce
- OADP version
- OpenShift version
- Cloud provider/platform
- Timeline of issue

Next Steps

After analyzing must-gather:

Apply Fixes: Implement identified remediation steps
Retest: Verify issue resolved
Collect New Must-Gather: Confirm fix worked
Update Documentation: Record solution for future reference
Open Support Case: If issue persists or is unclear

Related Skills:

diagnose-backup-issues - Additional backup troubleshooting techniques
install-oadp - Proper OADP installation to avoid deployment issues
create-backup - Backup creation best practices

Resources

OpenShift Must-Gather Documentation: https://docs.openshift.com/container-platform/latest/support/gathering-cluster-data.html
OADP Must-Gather GitHub: https://github.com/openshift/oadp-must-gather
Red Hat Support: https://access.redhat.com/support

Version: 1.0 Last Updated: 2025-11-17 Compatibility: OADP 1.3+, OpenShift 4.12+ CRITICAL: Must-gather is the primary diagnostic tool for OADP issues - use it early and often