Awesome-omni-skill acp-harness

Unified ACP client and evaluation harness. Connect to ACP-compatible agents programmatically, capture full trajectories (tools, thoughts, plans), and pipe to downstream analysis tools.

install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/tools/acp-harness-majiayu000" ~/.claude/skills/diegosouzapw-awesome-omni-skill-acp-harness && rm -rf "$T"
manifest: skills/tools/acp-harness-majiayu000/SKILL.md
source content

ACP Harness

Purpose

This skill provides a unified toolkit for ACP client usage and agent evaluation, optimized for TypeScript/JavaScript projects using Bun.

  1. ACP Client API - Headless programmatic access to ACP-compatible agents
  2. Evaluation Harness - Run prompts against agents and capture full trajectories

Use this when:

  • Comparing skills across different agents (Claude Code, Cursor, OpenCode, Amp, Goose, Factory)
  • Evaluating built-in tools vs MCP servers vs skills for the same task
  • Generating training data with full trajectory capture
  • Running regression tests in CI/CD pipelines
  • Building multi-agent applications on a headless transport layer

Foundation Use Cases

The harness is a foundation layer - it captures trajectories; scoring happens downstream.

flowchart LR
    Harness["ACP Harness"] -->|"trajectories"| Scoring["Braintrust / Custom Script"]
    Scoring -->|"scores"| Decision["Informed Choices"]
Use CaseHarness ProvidesYou Build
Cross-agent skill evalSame prompts → multiple agents → trajectoriesScoring pipeline (Braintrust, custom)
Tool comparisonTrajectory with tool/skill attributionDiff analysis, preference data
Training dataStructured I/O with tool calls, plans, thoughtsSFT/DPO formatting for world-agent
Regression testingDeterministic prompt → trajectory captureCI integration, golden file comparison
Multi-agent apps
createACPClient
transport layer
Session management, UI, agent switching

Agents Supporting Skills

Skills can be installed across multiple agents, enabling cross-agent comparison:

AgentSkills DirectoryInstall Command
Claude Code
.claude/skills/
./install-workshop.sh --agent claude
Cursor
.claude/skills/
./install-workshop.sh --agent cursor
OpenCode
.opencode/skill/
./install-workshop.sh --agent opencode
Amp
.agents/skills/
./install-workshop.sh --agent amp
Goose
.claude/skills/
./install-workshop.sh --agent goose
Factory
.factory/skills/
./install-workshop.sh --agent factory

Example: Comparing Built-in vs Skill

# Run same prompt with built-in tool
bun scripts/run-harness.ts prompts.jsonl \
  --agent claude-code-acp \
  -o results-builtin.jsonl

# Run same prompt with custom skill installed
bun scripts/run-harness.ts prompts.jsonl \
  --agent claude-code-acp \
  --cwd /project/with/typescript-lsp-skill \
  -o results-skill.jsonl

# Compare trajectories - which used better tools? faster? more accurate?
diff <(jq '.toolCalls' results-builtin.jsonl) <(jq '.toolCalls' results-skill.jsonl)

Execution Environment

Recommendation: Run evaluations in Docker containers for consistent, isolated execution.

# Build and run with Docker Compose
docker compose -f docker-compose.acp.yml run --rm acp-harness

# Or build directly
docker build -f Dockerfile.acp -t acp-harness .
docker run --rm -e ANTHROPIC_API_KEY acp-harness

Docker provides:

  • Consistent environment across local and CI
  • Filesystem isolation without app-level sandboxing
  • Reproducible results for training data generation

See assets/ for example container configurations:

  • Dockerfile.acp
    - Base container with Bun and git
  • docker-compose.acp.yml
    - Compose file with volume mounts for results

Non-Goals

This harness is optimized for TypeScript/JavaScript projects using Bun. It is not designed for:

Quick Reference

ResourceDescription
scripts/run-harness.ts
Execute prompts against agent, capture trajectories
client-api.md
createACPClient
configuration, helpers
output-formats.mdJSONL schemas, format options
downstream.mdIntegration patterns (Braintrust, jq, LLM-as-judge)
llm-judge-templates.mdEvaluation prompt templates

Evaluation Workflow

flowchart LR
    Prompts["prompts.jsonl"] --> Harness["run-harness.ts"]
    Agent["ACP Agent"] --> Harness
    Harness --> Summary["summary.jsonl"]
    Harness --> Judge["results.md + results.full.jsonl"]
    Summary --> JQ["jq analysis"]
    Judge --> LLM["LLM-as-judge"]
  1. Prepare - Create
    prompts.jsonl
    with evaluation cases
  2. Execute - Run harness against target agent
  3. Capture - Trajectories streamed to output files
  4. Analyze - Pipe to downstream tools for scoring

Harness Script

Basic Usage

bun scripts/run-harness.ts <prompts.jsonl> --agent <command> [options]

Arguments

FlagDescriptionDefault
prompts.jsonl
Input file with evaluation promptsRequired
-a, --agent
ACP agent command
"claude-code-acp"
-o, --output
Output file/pathstdout
-c, --cwd
Working directory for agentcurrent
-t, --timeout
Request timeout in ms
60000
-f, --format
Output format:
summary
,
judge
summary
--progress
Show progress to stderrfalse
--append
Append to output filefalse
--mcp-server
MCP server config JSON (repeatable)none

Examples

# Summary format (default) - minimal JSONL
bun scripts/run-harness.ts prompts.jsonl -o results.jsonl

# Judge format - creates two files for two-tier evaluation
bun scripts/run-harness.ts prompts.jsonl --format judge -o results
# Creates: results.md (summary with step IDs) + results.full.jsonl (complete trajectory)

# With MCP server (stdio transport)
bun scripts/run-harness.ts prompts.jsonl \
  --mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem","/data"]}'

# With MCP server (HTTP transport)
bun scripts/run-harness.ts prompts.jsonl \
  --mcp-server '{"type":"http","name":"api","url":"http://localhost:3000"}'

# Different agent (Droid ACP adapter)
bun scripts/run-harness.ts prompts.jsonl --agent droid-acp -o results.jsonl

# Stream with progress
bun scripts/run-harness.ts prompts.jsonl --progress -o results.jsonl

Input Format

Each line in

prompts.jsonl
:

{"id":"test-001","input":"Create a primary button","expected":"should contain <button>","metadata":{"category":"ui"}}
{"id":"test-002","input":"Write a bThread for form validation","metadata":{"category":"behavioral"}}
FieldRequiredDescription
id
YesUnique identifier
input
YesPrompt text for the agent
expected
NoExpected output (for downstream scoring)
metadata
NoTags, category, difficulty for filtering
timeout
NoOverride default timeout for this prompt

Output Formats

Summary Format (default)

Minimal JSONL for quick metrics and analysis:

{"id":"test-001","input":"Create a button","output":"I created...","toolCalls":["Write"],"status":"passed","duration":1234}

Judge Format (two-tier)

Creates two files for LLM-as-judge evaluation:

<output>.md
- Markdown summary with step IDs and code previews:

## Evaluation Record: test-001

**Input:** Create a primary button

**Trajectory:**
1. [THOUGHT] I'll create a styled button template... [->test-001-step-1]
2. [TOOL:Write] -> completed (234ms) [->test-001-step-2]
   File: src/button.tsx (847 chars)
   ```tsx
   import { createStyles } from 'plaited'

   type ButtonProps = {
     label: string

   // ... 30 lines omitted ...

   export const Button = ({ label }: ButtonProps) => (
     <button {...styles.btn}>{label}</button>
   )
  1. [MESSAGE] I created the button... [->test-001-step-3]

Output: I created the button template with primary styling. Metadata: category=ui, agent=claude-code-acp Status: passed Duration: 1234ms



**`<output>.full.jsonl`** - Complete trajectory with step IDs for correlation:

```jsonl
{"id":"test-001","input":"...","output":"...","trajectory":[{"type":"thought","content":"...","timestamp":100,"stepId":"test-001-step-1"},{"type":"tool_call","name":"Write","status":"completed","input":{...},"output":{...},"duration":234,"stepId":"test-001-step-2"}],...}

Usage patterns by judge context window:

Judge ModelStrategy
Gemini (1M+ tokens)Feed
results.full.jsonl
directly
Claude/GPT-4 (128-200k)Use
results.full.jsonl
for most runs
Smaller modelsUse
results.md
, retrieve specific steps by ID as needed

Programmatic Usage

import { createACPClient, createPrompt, summarizeResponse } from 'plaited/acp'

// Requires: npm install -g @zed-industries/claude-code-acp
const client = createACPClient({
  command: ['claude-code-acp'],
  cwd: '/path/to/project',
})

await client.connect()
const session = await client.createSession()

const { updates } = await client.promptSync(
  session.id,
  createPrompt('Create a button with hover state')
)

// Full trajectory is in updates
const summary = summarizeResponse(updates)
console.log({
  text: summary.text,
  toolCalls: summary.completedToolCalls,
  hasErrors: summary.hasErrors
})

await client.disconnect()

See client-api.md for complete API documentation.

Downstream Integration

The harness outputs standard JSONL that pipes to any tool:

# Filter with jq
cat results.jsonl | jq 'select(.metadata.category == "ui")'

# Count tool usage
cat results.jsonl | jq -s 'map(.toolCalls | length) | add'

# Feed full trajectory to Gemini (large context)
cat results.full.jsonl | your-gemini-judge.ts

See downstream.md for integration patterns with Braintrust, Gemini, and custom scorers.

Evaluation Targets

TargetHow to Evaluate
Agent capabilityDirect prompts, analyze trajectory quality
SkillsSet
--cwd
to project with skill, test skill-specific prompts
MCP ServersUse
--mcp-server
flag, verify tool usage in trajectory
Behavioral programsAnalyze trajectory for bThread coordination patterns

Skill Evaluation

bun scripts/run-harness.ts skill-prompts.jsonl \
  --cwd /project/with/skill \
  -o results.jsonl

MCP Server Evaluation

bun scripts/run-harness.ts mcp-prompts.jsonl \
  --mcp-server '{"type":"stdio","name":"fs","command":["mcp-filesystem"]}' \
  -o results.jsonl

Related

  • plaited/acp - Core ACP client module
  • world-agent - Training workflow, trajectory generation
  • plaited-behavioral-core - bThread patterns for agent coordination