OpenHarness harness-eval
This skill should be used when the user asks to "test the harness", "run integration tests", "validate features with real API", "test with real model calls", "run agent loop tests", "verify end-to-end", or needs to verify OpenHarness features on a real codebase with actual LLM calls.
git clone https://github.com/HKUDS/OpenHarness
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenHarness "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/harness-eval" ~/.claude/skills/hkuds-openharness-harness-eval && rm -rf "$T"
.claude/skills/harness-eval/SKILL.mdHarness Eval — End-to-End Feature Validation
Validate OpenHarness features by running real agent loops against an unfamiliar codebase with actual LLM API calls. Every test exercises the full stack: API client → model → tool calls → execution → result.
Core Principles
- Test on an unfamiliar project — never test on OpenHarness itself (the agent modifies its own code). Clone a real project as the workspace.
- Use real API calls — no mocks. Configure a real LLM endpoint.
- Multi-turn conversations — always test 2+ turns where the model needs prior context.
- Combine features — test hooks+skills+agent loop together, not in isolation.
- Verify tool execution — inspect tool call lists and output files, not just model text.
Workflow
1. Prepare Workspace
Clone an unfamiliar project (do not use OpenHarness):
git clone https://github.com/HKUDS/AutoAgent /tmp/eval-workspace
2. Configure Environment
export ANTHROPIC_API_KEY=sk-xxx export ANTHROPIC_BASE_URL=https://api.moonshot.cn/anthropic # or any provider export ANTHROPIC_MODEL=kimi-k2.5
For long-running real evals, do not artificially lower
max_turns. Use the product default (200) unless the user explicitly wants a tighter bound.
3. Prepare Real Sandbox Runtime When Relevant
If the task is validating sandbox behavior, install and verify the actual runtime before running agent loops:
npm install -g @anthropic-ai/sandbox-runtime sudo apt-get update sudo apt-get install -y bubblewrap ripgrep which srt which bwrap which rg srt --version
Then run a minimal smoke check through OpenHarness, not just raw
srt, so you verify the real adapter path:
from pathlib import Path from openharness.config.settings import Settings, SandboxSettings, save_settings from openharness.tools.bash_tool import BashTool cfg = Path("/tmp/openharness-sandbox-settings.json") save_settings(Settings(sandbox=SandboxSettings(enabled=True, fail_if_unavailable=True)), cfg) # Point config loader at this file, then run BashTool on a tiny command such as `pwd`.
If sandbox dependencies are missing, treat that as an environment/setup failure, not a feature regression.
4. Design Tests
Each test follows this pattern:
engine = make_engine(system_prompt="...", cwd=UNFAMILIAR_PROJECT) evs1 = [ev async for ev in engine.submit_message("Read X, analyze Y")] r1 = collect(evs1) # text, tools, turns, tokens evs2 = [ev async for ev in engine.submit_message("Based on what you found...")] r2 = collect(evs2) assert "grep" in r1["tools"] # verify tools ran
For detailed code templates and the
make_engine/collect helpers, consult references/test-patterns.md.
5. Prefer Long-Horizon, Real Agent Loops
For meaningful end-to-end validation, prefer unfamiliar-repo tasks that force multiple turns, context reuse, and mixed tool usage.
Recommended pattern:
- Use a real external workspace such as
AutoAgent - Use real provider credentials and the actual target model
- Keep
max_turns=200 - Use per-prompt timeouts large enough for real exploration, such as
240-600s - Require at least 2 turns per scenario
- Verify both text quality and tool traces
- Keep polling long-running sessions until they finish; do not abandon a run after the first long pause
Recommended long-horizon scenarios:
-
architecture_multiturn- Turn 1: map architecture, shell/subprocess surfaces, and test entrypoints
- Turn 2: identify top risks and propose refactors
- Turn 3: condense into onboarding or remediation actions
- Success:
,bash
,glob
,grep
all appear; no timeout; noread_fileMaxTurnsExceeded
-
hook_block_and_recover- Force the model to try
bash - Block it with a real pre-tool hook
- Verify the model adapts with
/glob
/grepread_file
- Force the model to try
-
sandbox_multiturn- Enable real sandbox settings with
fail_if_unavailable=true - First prompt must start with exactly one shell command such as
pwd && ls -la - Second prompt must explicitly reuse the prior shell findings
- Success:
executes via sandbox, non-shell tools continue the task, and the agent recovers from incidental repo errorsbash
- Enable real sandbox settings with
When a scenario fails, classify it before changing code:
: likely eval harness misconfiguration ifMaxTurnsExceeded
was manually loweredmax_turns
: task is too broad or per-prompt timeout is too smalltimeout- sandbox unavailable: environment missing
,srt
, orbwraprg - tool error with task still completed: feature may still be healthy; inspect recovery behavior
6. Run Tests
python tests/test_merged_prs_on_autoagent.py # PR feature tests python tests/test_real_large_tasks.py # large multi-step tasks python tests/test_hooks_skills_plugins_real.py # hooks/skills/plugins python -m pytest tests/ -q -k "not autoagent" # unit tests (no API)
For ad hoc long-horizon validation, it is acceptable to run a temporary Python driver script as long as it:
- uses real OpenHarness engine/tool objects
- targets an unfamiliar repository
- prints per-scenario JSON summaries
- records tools, errors, turns, and token usage
- stays attached until completion
7. Interpret Results
| Result | Meaning | Action |
|---|---|---|
| PASS with tool calls | Feature works end-to-end | Done |
| PASS without tool calls | Model answered from knowledge | Rewrite prompt to force tool use |
| FAIL with exception | Code bug | Read traceback |
| FAIL with wrong output | Model behavior issue | Check system prompt and tool schemas |
| Timeout | Task too complex | Increase or simplify prompt |
For long-running real evals, refine the timeout guidance:
- First check whether
was manually set too lowmax_turns - If
and the run still fails, the next suspect is wall-clock timeout, not turn countmax_turns=200 - Distinguish environment failures from product failures
- Example: missing dependency in the unfamiliar target repo is not automatically an OpenHarness regression
- Example: missing
/srt
/bwrap
is an eval environment issuerg
Feature Coverage Checklist
- Engine: multi-turn memory, tool chaining, parallel tools, error recovery, auto-compaction
- Swarm: InProcessBackend lifecycle, concurrent teammates, coordinator+notifications
- Hooks: pre_tool_use blocking → model adapts, post_tool_use firing
- Skills: skill tool invocation → model follows instructions
- Plugins: plugin-provided skill loaded and used in agent loop
- Memory: YAML frontmatter parsing, body content search, context injection
- Session: save → load → resume with context preserved
- Providers: Anthropic client, OpenAI client (with reasoning_content), multi-turn
- Cost: token accumulation across turns
Common Pitfalls
- Testing on OpenHarness itself — agent modifies its own running code
- Using mocks — misses serialization and API compatibility bugs
- Single-turn only — misses context accumulation and compaction bugs
- Artificially lowering
during real evals — can create false failures that do not reflect product defaultsmax_turns - Not checking tool call list — model may claim tool use without calling it
- Hardcoding paths — use
variable, skip in CI withWORKSPACEpytest.mark.skipif - Declaring sandbox “tested” after only checking raw
— verify the OpenHarness adapter path toosrt - Abandoning long tasks too early — some real tasks pause for minutes before the next event arrives
Additional Resources
Reference Files
— Complete code templates forreferences/test-patterns.md
,make_engine
, and each feature categorycollect
— Detailed test cases for every OpenHarness modulereferences/feature-matrix.md
Existing Test Files
Working test suites in the repo:
— PR feature validationtests/test_merged_prs_on_autoagent.py
— Large multi-step taskstests/test_real_large_tasks.py
— Hooks/skills/plugins in agent loopstests/test_hooks_skills_plugins_real.py
— Module-level integration teststests/test_untested_features.py