Claude-skill-registry debug-stuck-eval
Debug stuck Hawk/Inspect AI evaluations. Use when user mentions "stuck eval", "eval not progressing", "eval hanging", "samples not completing", "eval set frozen", "runner stuck", "500 errors in eval", "retry loop", "eval timeout", or asks why an evaluation isn't finishing.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/debug-stuck-eval" ~/.claude/skills/majiayu000-claude-skill-registry-debug-stuck-eval && rm -rf "$T"
manifest:
skills/data/debug-stuck-eval/SKILL.mdsource content
Quick Checklist
- Verify auth:
hawk auth access-token > /dev/null || echo "Run 'hawk login' first" - Get eval-set-id from user
- Check status:
- JSON report with pod state, logs, metricshawk status <eval-set-id> - View logs:
orhawk logs <eval-set-id>
for follow modehawk logs -f - List samples:
- see completion statushawk list samples <eval-set-id> - Look for error patterns (see below)
- Test API directly if logs show retries without clear errors
Error Patterns
| Log Pattern | Meaning | Resolution |
|---|---|---|
| OpenAI SDK hiding actual error | Test API directly with curl to see real error |
| API issue | Download buffer, find failing request, test through middleman AND directly to provider |
| Token/context limit exceeded | Check message count and model context window |
| Sandbox pod was killed and restarted | No fix needed—sample errored out, Inspect will retry |
Empty output, | API returned malformed response | Restart eval (buffer resumes) |
| OOMKilled in pod status | Memory exhaustion | Increase pod memory limits |
Key Techniques
- SDK hides errors by design - The OpenAI SDK hides transient errors during retry backoff. "Retrying request" logs don't show the actual error. Use curl to see real errors.
- FAIL-OK patterns are fine - Alternating failures and successes mean the eval IS progressing. Only worry about consistent FAIL-FAIL-FAIL patterns.
- Use S3 for buffer access - Download
from S3 rather than accessing the runner pod directly..buffer/ - Read .eval files with inspect_ai - Use
instead of manually extracting zips.from inspect_ai.log import read_eval_log
Test API Directly
Middleman is the auth proxy. If middleman fails but direct provider calls work, it's a middleman issue.
TOKEN=$(hawk auth access-token) # Test through middleman curl --max-time 300 -X POST https://middleman.internal.metr.org/anthropic/v1/messages \ -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \ -d '{"model": "claude-sonnet-4-20250514", "max_tokens": 100, "messages": [{"role": "user", "content": "Say hello"}]}' # Test OpenAI-compatible curl --max-time 300 -X POST https://middleman.internal.metr.org/openai/v1/chat/completions \ -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \ -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Say hello"}], "max_tokens": 100}'
Recovery
# Delete stuck eval and restart hawk delete <eval-set-id> hawk eval-set <config.yaml>
The sample buffer in S3 allows Inspect to resume from where it left off (unless you use
--no-resume).
HTTP Retry Count
Task progress logs include "HTTP retries: X". High retry counts indicate API instability even while tasks complete.
Severity: Retry count × wait time = stuck duration. E.g., 45 retries × 1800s = 22+ hours stuck.
More Details
See
docs/debugging-stuck-evals.md for:
- Sample buffer SQL queries
- Detailed API testing examples
- Escalation checklist
References
- Inspect AI Model Providers - Model configuration
- Inspect AI Eval Logs - .eval file format
Filing Issues
- Middleman: https://github.com/metr-middleman/middleman-server/issues
- Hawk: Linear issue on Evals Execution team
- Inspect AI: https://github.com/UKGovernmentBEIS/inspect_ai/issues