Awesome-omni-skill acp-harness
Unified ACP client and evaluation harness. Connect to ACP-compatible agents programmatically, capture full trajectories (tools, thoughts, plans), and pipe to downstream analysis tools.
git clone https://github.com/diegosouzapw/awesome-omni-skill
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/tools/acp-harness-majiayu000" ~/.claude/skills/diegosouzapw-awesome-omni-skill-acp-harness && rm -rf "$T"
skills/tools/acp-harness-majiayu000/SKILL.mdACP Harness
Purpose
This skill provides a unified toolkit for ACP client usage and agent evaluation, optimized for TypeScript/JavaScript projects using Bun.
- ACP Client API - Headless programmatic access to ACP-compatible agents
- Evaluation Harness - Run prompts against agents and capture full trajectories
Use this when:
- Comparing skills across different agents (Claude Code, Cursor, OpenCode, Amp, Goose, Factory)
- Evaluating built-in tools vs MCP servers vs skills for the same task
- Generating training data with full trajectory capture
- Running regression tests in CI/CD pipelines
- Building multi-agent applications on a headless transport layer
Foundation Use Cases
The harness is a foundation layer - it captures trajectories; scoring happens downstream.
flowchart LR Harness["ACP Harness"] -->|"trajectories"| Scoring["Braintrust / Custom Script"] Scoring -->|"scores"| Decision["Informed Choices"]
| Use Case | Harness Provides | You Build |
|---|---|---|
| Cross-agent skill eval | Same prompts → multiple agents → trajectories | Scoring pipeline (Braintrust, custom) |
| Tool comparison | Trajectory with tool/skill attribution | Diff analysis, preference data |
| Training data | Structured I/O with tool calls, plans, thoughts | SFT/DPO formatting for world-agent |
| Regression testing | Deterministic prompt → trajectory capture | CI integration, golden file comparison |
| Multi-agent apps | transport layer | Session management, UI, agent switching |
Agents Supporting Skills
Skills can be installed across multiple agents, enabling cross-agent comparison:
| Agent | Skills Directory | Install Command |
|---|---|---|
| Claude Code | | |
| Cursor | | |
| OpenCode | | |
| Amp | | |
| Goose | | |
| Factory | | |
Example: Comparing Built-in vs Skill
# Run same prompt with built-in tool bun scripts/run-harness.ts prompts.jsonl \ --agent claude-code-acp \ -o results-builtin.jsonl # Run same prompt with custom skill installed bun scripts/run-harness.ts prompts.jsonl \ --agent claude-code-acp \ --cwd /project/with/typescript-lsp-skill \ -o results-skill.jsonl # Compare trajectories - which used better tools? faster? more accurate? diff <(jq '.toolCalls' results-builtin.jsonl) <(jq '.toolCalls' results-skill.jsonl)
Execution Environment
Recommendation: Run evaluations in Docker containers for consistent, isolated execution.
# Build and run with Docker Compose docker compose -f docker-compose.acp.yml run --rm acp-harness # Or build directly docker build -f Dockerfile.acp -t acp-harness . docker run --rm -e ANTHROPIC_API_KEY acp-harness
Docker provides:
- Consistent environment across local and CI
- Filesystem isolation without app-level sandboxing
- Reproducible results for training data generation
See assets/ for example container configurations:
- Base container with Bun and gitDockerfile.acp
- Compose file with volume mounts for resultsdocker-compose.acp.yml
Non-Goals
This harness is optimized for TypeScript/JavaScript projects using Bun. It is not designed for:
- Python projects - Use SWE-bench, Braintrust Python SDK
- Academic model benchmarking - Use EleutherAI lm-evaluation-harness
- IDE integrations - Use Copilot Evaluation Harness
- SaaS observability - Use Braintrust, Langfuse platforms directly
Quick Reference
| Resource | Description |
|---|---|
| Execute prompts against agent, capture trajectories |
| client-api.md | configuration, helpers |
| output-formats.md | JSONL schemas, format options |
| downstream.md | Integration patterns (Braintrust, jq, LLM-as-judge) |
| llm-judge-templates.md | Evaluation prompt templates |
Evaluation Workflow
flowchart LR Prompts["prompts.jsonl"] --> Harness["run-harness.ts"] Agent["ACP Agent"] --> Harness Harness --> Summary["summary.jsonl"] Harness --> Judge["results.md + results.full.jsonl"] Summary --> JQ["jq analysis"] Judge --> LLM["LLM-as-judge"]
- Prepare - Create
with evaluation casesprompts.jsonl - Execute - Run harness against target agent
- Capture - Trajectories streamed to output files
- Analyze - Pipe to downstream tools for scoring
Harness Script
Basic Usage
bun scripts/run-harness.ts <prompts.jsonl> --agent <command> [options]
Arguments
| Flag | Description | Default |
|---|---|---|
| Input file with evaluation prompts | Required |
| ACP agent command | |
| Output file/path | stdout |
| Working directory for agent | current |
| Request timeout in ms | |
| Output format: , | |
| Show progress to stderr | false |
| Append to output file | false |
| MCP server config JSON (repeatable) | none |
Examples
# Summary format (default) - minimal JSONL bun scripts/run-harness.ts prompts.jsonl -o results.jsonl # Judge format - creates two files for two-tier evaluation bun scripts/run-harness.ts prompts.jsonl --format judge -o results # Creates: results.md (summary with step IDs) + results.full.jsonl (complete trajectory) # With MCP server (stdio transport) bun scripts/run-harness.ts prompts.jsonl \ --mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem","/data"]}' # With MCP server (HTTP transport) bun scripts/run-harness.ts prompts.jsonl \ --mcp-server '{"type":"http","name":"api","url":"http://localhost:3000"}' # Different agent (Droid ACP adapter) bun scripts/run-harness.ts prompts.jsonl --agent droid-acp -o results.jsonl # Stream with progress bun scripts/run-harness.ts prompts.jsonl --progress -o results.jsonl
Input Format
Each line in
prompts.jsonl:
{"id":"test-001","input":"Create a primary button","expected":"should contain <button>","metadata":{"category":"ui"}} {"id":"test-002","input":"Write a bThread for form validation","metadata":{"category":"behavioral"}}
| Field | Required | Description |
|---|---|---|
| Yes | Unique identifier |
| Yes | Prompt text for the agent |
| No | Expected output (for downstream scoring) |
| No | Tags, category, difficulty for filtering |
| No | Override default timeout for this prompt |
Output Formats
Summary Format (default)
Minimal JSONL for quick metrics and analysis:
{"id":"test-001","input":"Create a button","output":"I created...","toolCalls":["Write"],"status":"passed","duration":1234}
Judge Format (two-tier)
Creates two files for LLM-as-judge evaluation:
- Markdown summary with step IDs and code previews:<output>.md
## Evaluation Record: test-001 **Input:** Create a primary button **Trajectory:** 1. [THOUGHT] I'll create a styled button template... [->test-001-step-1] 2. [TOOL:Write] -> completed (234ms) [->test-001-step-2] File: src/button.tsx (847 chars) ```tsx import { createStyles } from 'plaited' type ButtonProps = { label: string // ... 30 lines omitted ... export const Button = ({ label }: ButtonProps) => ( <button {...styles.btn}>{label}</button> )
- [MESSAGE] I created the button... [->test-001-step-3]
Output: I created the button template with primary styling. Metadata: category=ui, agent=claude-code-acp Status: passed Duration: 1234ms
**`<output>.full.jsonl`** - Complete trajectory with step IDs for correlation: ```jsonl {"id":"test-001","input":"...","output":"...","trajectory":[{"type":"thought","content":"...","timestamp":100,"stepId":"test-001-step-1"},{"type":"tool_call","name":"Write","status":"completed","input":{...},"output":{...},"duration":234,"stepId":"test-001-step-2"}],...}
Usage patterns by judge context window:
| Judge Model | Strategy |
|---|---|
| Gemini (1M+ tokens) | Feed directly |
| Claude/GPT-4 (128-200k) | Use for most runs |
| Smaller models | Use , retrieve specific steps by ID as needed |
Programmatic Usage
import { createACPClient, createPrompt, summarizeResponse } from 'plaited/acp' // Requires: npm install -g @zed-industries/claude-code-acp const client = createACPClient({ command: ['claude-code-acp'], cwd: '/path/to/project', }) await client.connect() const session = await client.createSession() const { updates } = await client.promptSync( session.id, createPrompt('Create a button with hover state') ) // Full trajectory is in updates const summary = summarizeResponse(updates) console.log({ text: summary.text, toolCalls: summary.completedToolCalls, hasErrors: summary.hasErrors }) await client.disconnect()
See client-api.md for complete API documentation.
Downstream Integration
The harness outputs standard JSONL that pipes to any tool:
# Filter with jq cat results.jsonl | jq 'select(.metadata.category == "ui")' # Count tool usage cat results.jsonl | jq -s 'map(.toolCalls | length) | add' # Feed full trajectory to Gemini (large context) cat results.full.jsonl | your-gemini-judge.ts
See downstream.md for integration patterns with Braintrust, Gemini, and custom scorers.
Evaluation Targets
| Target | How to Evaluate |
|---|---|
| Agent capability | Direct prompts, analyze trajectory quality |
| Skills | Set to project with skill, test skill-specific prompts |
| MCP Servers | Use flag, verify tool usage in trajectory |
| Behavioral programs | Analyze trajectory for bThread coordination patterns |
Skill Evaluation
bun scripts/run-harness.ts skill-prompts.jsonl \ --cwd /project/with/skill \ -o results.jsonl
MCP Server Evaluation
bun scripts/run-harness.ts mcp-prompts.jsonl \ --mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem"]}' \ -o results.jsonl
Related
- plaited/acp - Core ACP client module
- world-agent - Training workflow, trajectory generation
- plaited-behavioral-core - bThread patterns for agent coordination