Skills debug-buttercup
git clone https://github.com/trailofbits/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/trailofbits/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/debug-buttercup/skills/debug-buttercup" ~/.claude/skills/trailofbits-skills-debug-buttercup && rm -rf "$T"
plugins/debug-buttercup/skills/debug-buttercup/SKILL.mdDebug Buttercup
When to Use
- Pods in the
namespace are in CrashLoopBackOff, OOMKilled, or restartingcrs - Multiple services restart simultaneously (cascade failure)
- Redis is unresponsive or showing AOF warnings
- Queues are growing but tasks are not progressing
- Nodes show DiskPressure, MemoryPressure, or PID pressure
- Build-bot cannot reach the Docker daemon (DinD failures)
- Scheduler is stuck and not advancing task state
- Health check probes are failing unexpectedly
- Deployed Helm values don't match actual pod configuration
When NOT to Use
- Deploying or upgrading Buttercup (use Helm and deployment guides)
- Debugging issues outside the
Kubernetes namespacecrs - Performance tuning that doesn't involve a failure symptom
Namespace and Services
All pods run in namespace
crs. Key services:
| Layer | Services |
|---|---|
| Infra | redis, dind, litellm, registry-cache |
| Orchestration | scheduler, task-server, task-downloader, scratch-cleaner |
| Fuzzing | build-bot, fuzzer-bot, coverage-bot, tracer-bot, merger-bot |
| Analysis | patcher, seed-gen, program-model, pov-reproducer |
| Interface | competition-api, ui |
Triage Workflow
Always start with triage. Run these three commands first:
# 1. Pod status - look for restarts, CrashLoopBackOff, OOMKilled kubectl get pods -n crs -o wide # 2. Events - the timeline of what went wrong kubectl get events -n crs --sort-by='.lastTimestamp' # 3. Warnings only - filter the noise kubectl get events -n crs --field-selector type=Warning --sort-by='.lastTimestamp'
Then narrow down:
# Why did a specific pod restart? Check Last State Reason (OOMKilled, Error, Completed) kubectl describe pod -n crs <pod-name> | grep -A8 'Last State:' # Check actual resource limits vs intended kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}' # Crashed container's logs (--previous = the container that died) kubectl logs -n crs <pod-name> --previous --tail=200 # Current logs kubectl logs -n crs <pod-name> --tail=200
Historical vs Ongoing Issues
High restart counts don't necessarily mean an issue is ongoing -- restarts accumulate over a pod's lifetime. Always distinguish:
shows the end of the log buffer, which may contain old messages. Use--tail
to confirm issues are actively happening now.--since=300s
on log output helps correlate events across services.--timestamps- Check
timestamps inLast State
to see when the most recent crash actually occurred.describe pod
Cascade Detection
When many pods restart around the same time, check for a shared-dependency failure before investigating individual pods. The most common cascade: Redis goes down -> every service gets
ConnectionError/ConnectionRefusedError -> mass restarts. Look for the same error across multiple --previous logs -- if they all say redis.exceptions.ConnectionError, debug Redis, not the individual services.
Log Analysis
# All replicas of a service at once kubectl logs -n crs -l app=fuzzer-bot --tail=100 --prefix # Stream live kubectl logs -n crs -l app.kubernetes.io/name=redis -f # Collect all logs to disk (existing script) bash deployment/collect-logs.sh
Resource Pressure
# Per-pod CPU/memory kubectl top pods -n crs # Node-level kubectl top nodes # Node conditions (disk pressure, memory pressure, PID pressure) kubectl describe node <node> | grep -A5 Conditions # Disk usage inside a pod kubectl exec -n crs <pod> -- df -h # What's eating disk kubectl exec -n crs <pod> -- sh -c 'du -sh /corpus/* 2>/dev/null' kubectl exec -n crs <pod> -- sh -c 'du -sh /scratch/* 2>/dev/null'
Redis Debugging
Redis is the backbone. When it goes down, everything cascades.
# Redis pod status kubectl get pods -n crs -l app.kubernetes.io/name=redis # Redis logs (AOF warnings, OOM, connection issues) kubectl logs -n crs -l app.kubernetes.io/name=redis --tail=200 # Connect to Redis CLI kubectl exec -n crs <redis-pod> -- redis-cli # Inside redis-cli: key diagnostics INFO memory # used_memory_human, maxmemory INFO persistence # aof_enabled, aof_last_bgrewrite_status, aof_delayed_fsync INFO clients # connected_clients, blocked_clients INFO stats # total_connections_received, rejected_connections CLIENT LIST # see who's connected DBSIZE # total keys # AOF configuration CONFIG GET appendonly # is AOF enabled? CONFIG GET appendfsync # fsync policy: everysec, always, or no # What is /data mounted on? (disk vs tmpfs matters for AOF performance)
kubectl exec -n crs <redis-pod> -- mount | grep /data kubectl exec -n crs <redis-pod> -- du -sh /data/
Queue Inspection
Buttercup uses Redis streams with consumer groups. Queue names:
| Queue | Stream Key |
|---|---|
| Build | fuzzer_build_queue |
| Build Output | fuzzer_build_output_queue |
| Crash | fuzzer_crash_queue |
| Confirmed Vulns | confirmed_vulnerabilities_queue |
| Download Tasks | orchestrator_download_tasks_queue |
| Ready Tasks | tasks_ready_queue |
| Patches | patches_queue |
| Index | index_queue |
| Index Output | index_output_queue |
| Traced Vulns | traced_vulnerabilities_queue |
| POV Requests | pov_reproducer_requests_queue |
| POV Responses | pov_reproducer_responses_queue |
| Delete Task | orchestrator_delete_task_queue |
# Check stream length (pending messages) kubectl exec -n crs <redis-pod> -- redis-cli XLEN fuzzer_build_queue # Check consumer group lag kubectl exec -n crs <redis-pod> -- redis-cli XINFO GROUPS fuzzer_build_queue # Check pending messages per consumer kubectl exec -n crs <redis-pod> -- redis-cli XPENDING fuzzer_build_queue build_bot_consumers - + 10 # Task registry size kubectl exec -n crs <redis-pod> -- redis-cli HLEN tasks_registry # Task state counts kubectl exec -n crs <redis-pod> -- redis-cli SCARD cancelled_tasks kubectl exec -n crs <redis-pod> -- redis-cli SCARD succeeded_tasks kubectl exec -n crs <redis-pod> -- redis-cli SCARD errored_tasks
Consumer groups:
build_bot_consumers, orchestrator_group, patcher_group, index_group, tracer_bot_group.
Health Checks
Pods write timestamps to
/tmp/health_check_alive. The liveness probe checks file freshness.
# Check health file freshness kubectl exec -n crs <pod> -- stat /tmp/health_check_alive kubectl exec -n crs <pod> -- cat /tmp/health_check_alive
If a pod is restart-looping, the health check file is likely going stale because the main process is blocked (e.g. waiting on Redis, stuck on I/O).
Telemetry (OpenTelemetry / Signoz)
All services export traces and metrics via OpenTelemetry. If Signoz is deployed (
global.signoz.deployed: true), use its UI for distributed tracing across services.
# Check if OTEL is configured kubectl exec -n crs <pod> -- env | grep OTEL # Verify Signoz pods are running (if deployed) kubectl get pods -n platform -l app.kubernetes.io/name=signoz
Traces are especially useful for diagnosing slow task processing, identifying which service in a pipeline is the bottleneck, and correlating events across the scheduler -> build-bot -> fuzzer-bot chain.
Volume and Storage
# PVC status kubectl get pvc -n crs # Check if corpus tmpfs is mounted, its size, and backing type kubectl exec -n crs <pod> -- mount | grep corpus_tmpfs kubectl exec -n crs <pod> -- df -h /corpus_tmpfs 2>/dev/null # Check if CORPUS_TMPFS_PATH is set kubectl exec -n crs <pod> -- env | grep CORPUS # Full disk layout - what's on real disk vs tmpfs kubectl exec -n crs <pod> -- df -h
CORPUS_TMPFS_PATH is set when global.volumes.corpusTmpfs.enabled: true. This affects fuzzer-bot, coverage-bot, seed-gen, and merger-bot.
Deployment Config Verification
When behavior doesn't match expectations, verify Helm values actually took effect:
# Check a pod's actual resource limits kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}' # Check a pod's actual volume definitions kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.volumes}'
Helm values template typos (e.g. wrong key names) silently fall back to chart defaults. If deployed resources don't match the values template, check for key name mismatches.
Service-Specific Debugging
For detailed per-service symptoms, root causes, and fixes, see references/failure-patterns.md.
Quick reference:
- DinD:
-- look for docker daemon crashes, storage driver errorskubectl logs -n crs -l app=dind --tail=100 - Build-bot: check build queue depth, DinD connectivity, OOM during compilation
- Fuzzer-bot: corpus disk usage, CPU throttling, crash queue backlog
- Patcher: LiteLLM connectivity, LLM timeout, patch queue depth
- Scheduler: the central brain --
kubectl logs -n crs -l app=scheduler --tail=-1 --prefix | grep "WAIT_PATCH_PASS\|ERROR\|SUBMIT"
Diagnostic Script
Run the automated triage snapshot:
bash {baseDir}/scripts/diagnose.sh
Pass
--full to also dump recent logs from all pods:
bash {baseDir}/scripts/diagnose.sh --full
This collects pod status, events, resource usage, Redis health, and queue depths in one pass.