Awesome-omni-skill web-search-agent-evals

Development assistant for web search agent evaluations across multiple CLI agents

install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/ai-agents/web-search-agent-evals" ~/.claude/skills/diegosouzapw-awesome-omni-skill-web-search-agent-evals && rm -rf "$T"
manifest: skills/ai-agents/web-search-agent-evals/SKILL.md
safety · automated scan (medium risk)
This is a pattern-based risk scan, not a security review. Our crawler flagged:
  • global npm install
  • references .env files
  • references API keys
Always read a skill's source content before installing. Patterns alone don't mean the skill is malicious — but they warrant attention.
source content

Web Search Agent Evaluations

Development assistant for running and comparing web search capabilities across CLI agents.

Overview

Evaluate 4 agents (Claude Code, Gemini, Droid, Codex) with 2 tools (builtin, You.com MCP) = 8 pairings.

Key Features:

  • Headless adapters - Schema-driven CLI agent execution via @plaited/agent-eval-harness
  • Flag-based architecture - Single service per agent, mode selected via environment variables
  • Type-safe constants - MCP server definitions in TypeScript
  • Isolated execution - Each pairing runs in its own Docker container

Architecture:

  • agent-schemas/
    - Headless adapter JSON schemas
  • mcp-servers.ts
    - TypeScript MCP server constants
  • docker/entrypoint
    - Bun shell script for runtime config
  • scripts/
    - Type-safe execution and comparison CLI tools
  • docker/
    - Container infrastructure

Quick Commands

Run Evaluations

# Full dataset (151 prompts), k=5 — all 8 agent×provider scenarios
bun run trials

# Quick smoke test (5 random prompts, single trial)
bun run trials -- --count 5 -k 1

# Specific agent or provider
bun run trials -- --agent claude-code --search-provider builtin
bun run trials -- --agent gemini --search-provider you

# Trial type presets
bun run trials -- --trial-type capability   # k=10, deep exploration
bun run trials -- --trial-type regression   # k=3, fast regression check

# Custom k value
bun run trials -- -k 7

# Control parallelism
bun run trials -- -j 4                      # Limit to 4 containers
bun run trials -- --prompt-concurrency 4    # 4 prompts per container

# Direct Docker (manual testing)
docker compose run --rm -e SEARCH_PROVIDER=builtin claude-code
docker compose run --rm -e SEARCH_PROVIDER=you -e PROMPT_COUNT=5 gemini

Compare Results

Comparisons are written to

data/comparisons/YYYY-MM-DD/
.

# Latest date auto-detected
bun run compare

# Statistical analysis with bootstrap confidence intervals
bun run compare:stat

# Specific date or filter
bun run compare -- --run-date 2026-02-18
bun run compare -- --agent droid
bun run compare -- --search-provider builtin
bun run compare -- --trial-type capability

# View results
cat data/comparisons/2026-02-18/all-builtin-weighted.json | jq '.capability'
cat data/comparisons/2026-02-18/builtin-vs-you-weighted.json | jq '.headToHead.capability'

Comparison strategies:

  • weighted
    (default) - Capability, reliability, and consistency weighted scoring
  • statistical
    - Bootstrap sampling with 95% confidence intervals

Generate Report

Generate a comprehensive

REPORT.md
from comparison results:

# Latest date auto-detected
bun run report

# Specific date
bun run report -- --run-date 2026-02-18

# Preview without writing
bun run report -- --dry-run

Report includes:

  • Executive summary with best capability, reliability, and performance
  • Quality rankings with pass@k and pass^k scores
  • Performance rankings (latency P50/P90/P99)
  • Flakiness analysis with top flaky prompts
  • MCP tool impact analysis (builtin vs MCP comparison)
  • Tool call statistics (P50/P90/P99/mean per provider)
  • Tool call distribution histograms
  • Failing prompts list (pass@k = 0%) with query text

Output:

data/comparisons/YYYY-MM-DD/REPORT.md

Calibrate Grader

Interactive wizard to sample failures and review grader accuracy. Helps distinguish between agent failures (agent got it wrong) and grader bugs (agent was correct, grader too strict).

# Interactive calibration (recommended)
bun run calibrate

Interactive prompts:

  1. Run date - Select from available dated runs
  2. Agents - Multi-select via numbers or "all"
  3. Search providers - Multi-select via numbers or "all"
  4. Sample count - Number of failures to sample (default: 5)

Output:

data/calibration/{date}-{agent}-{provider}.md

What calibration reveals:

  • Grader too strict - Agent gave correct answer, grader rejected valid paraphrasing
  • Hint too vague - Grader can't tell good from bad answers
  • Real failures - Agent genuinely gave wrong/incomplete answer

See @agent-eval-harness calibration docs for grader calibration concepts.

Parallelization

The evaluation harness supports two-level parallelization for optimal performance:

Container-Level Concurrency (
-j
,
--concurrency
)

Controls how many Docker containers (agent×provider scenarios) run simultaneously.

bun run trials              # Unlimited (default, all 8 scenarios at once)
bun run trials -- -j 4     # Limit to 4 containers
bun run trials -- -j 1     # Sequential (debugging)

Use cases:

  • Unlimited (default) - All scenarios at once, I/O-bound workload handles it fine
  • -j 4
    - Limit concurrency if hitting API rate limits
  • -j 2
    - Conservative, for low-resource machines
  • -j 1
    - Sequential execution for debugging

Prompt-Level Concurrency (
--prompt-concurrency
)

Controls how many prompts run in parallel within each container.

bun run trials -- --prompt-concurrency 4    # 4 prompts (moderate parallelism)
bun run trials -- --prompt-concurrency 1    # Sequential (default, safest)
bun run trials -- --prompt-concurrency 8    # 8 prompts (high memory, CI only)

How it works:

  • Uses harness
    -j
    flag with
    --workspace-dir
    for isolation
  • Each prompt gets its own workspace directory
  • Web searches are I/O-bound — parallel prompts maximize network bandwidth

Performance comparison:

ConfigContainersPrompts/ContainerFull (151 prompts, k=5)
Defaultunlimited1~2.5 hrs
Fasterunlimited4~40 min
CI (high memory)unlimited8~20 min

Warning: Stream-mode agents (claude-code, droid) use ~400-500MB RSS per prompt process. With

--prompt-concurrency 8
that's 3-4GB per container — OOM kills likely in Docker (see issue #45)

Prompts

Prompts live in a flat

data/prompts/
directory. The format differs by search provider:

  • Builtin mode: Just the query (e.g., "What are the best free icon libraries...")
  • MCP mode:
    "Use {server-name} and answer\n{query}"
    with MCP metadata
FilePromptsMetadataUse With
prompts.jsonl
151No MCP
SEARCH_PROVIDER=builtin
prompts-you.jsonl
151
mcpServer="ydc-server"
,
expectedTools=["you-search"]
SEARCH_PROVIDER=you

The entrypoint automatically selects the correct prompt file based on

SEARCH_PROVIDER
. To run a random subset, pass
PROMPT_COUNT
(or
--count N
via CLI):

bun run trials -- --count 5    # 5 random prompts from full dataset

Results

All trial results are written to flat dated directories:

data/results/YYYY-MM-DD/
├── claude-code/
│   ├── builtin.jsonl
│   └── you.jsonl
├── gemini/
├── droid/
└── codex/

Each

.jsonl
line is a
TrialResult
:

{"id":"websearch-001","input":"...","k":5,"passRate":0.8,"passAtK":0.999,"passExpK":0.328,"trials":[...]}

Versioning:

git add data/results/ && git commit -m "feat: trial run YYYY-MM-DD"

Compare runs:

bun run compare                             # Latest date auto-detected
bun run compare -- --run-date 2026-02-18   # Specific date

Adding a New Agent

1. Create Headless Adapter Schema

Create

agent-schemas/<agent>.json
:

{
  "command": ["<agent-cli>", "--flag", "{input}"],
  "outputEvents": {
    "match": { "path": "$.type" },
    "patterns": {
      "text": { "value": "text" },
      "tool_call": { "value": "tool_call" },
      "tool_result": { "value": "tool_result" }
    }
  },
  "result": {
    "contentPath": "$.output",
    "errorPath": "$.error"
  },
  "mode": "stream",
  "env": ["AGENT_API_KEY"]
}

Key fields:

  • command
    - CLI invocation with
    {input}
    placeholder
  • outputEvents.match.path
    - JSONPath to event type field
  • patterns
    - Map event types to standard names
  • result.contentPath
    - JSONPath to extract final output
  • mode
    -
    "stream"
    (persistent) or
    "iterative"
    (new process per turn)
  • env
    - Required environment variables

Test schema:

bunx @plaited/agent-eval-harness adapter:check -- \
  bunx @plaited/agent-eval-harness headless --schema agent-schemas/<agent>.json

2. Create Dockerfile

Create

docker/<agent>.Dockerfile
:

FROM base

# Install agent CLI
RUN npm install -g <agent-cli>

# Copy entrypoint and MCP config
COPY docker/entrypoint /entrypoint
COPY mcp-servers.ts /eval/mcp-servers.ts

RUN chmod +x /entrypoint

CMD ["/entrypoint"]

Verify installation:

docker build -t test-<agent> -f docker/<agent>.Dockerfile .
docker run --rm test-<agent> <agent> --version

3. Add Docker Compose Service

Add to

docker-compose.yml
:

<agent>:
  build:
    context: .
    dockerfile: docker/<agent>.Dockerfile
  volumes:
    - ./agent-schemas:/eval/agent-schemas:ro
    - ./data:/eval/data
    - ./scripts:/eval/scripts:ro
  working_dir: /workspace
  env_file: .env
  environment:
    - AGENT=<agent>
    - SEARCH_PROVIDER=${SEARCH_PROVIDER:-builtin}

4. Update TypeScript Entrypoint

Edit

docker/entrypoint
to add agent to
configureMcp()
function:

const configureMcp = async (agent: string, tool: McpServerKey): Promise<void> => {
  const server = MCP_SERVERS[tool]
  const apiKey = server.auth ? process.env[server.auth.envVar] : undefined

  switch (agent) {
    // ... existing cases ...

    case '<agent>': {
      await $`<agent> mcp add ${server.name} ${server.url} --header "Authorization: Bearer ${apiKey}"`.quiet()
      console.log('✓ Agent MCP server added')
      break
    }
  }
}

Add timeout if needed in

buildTrialsCommand()
:

switch (AGENT) {
  case '<agent>':
    cmd.push('--timeout', '120000')  // 2 minutes
    break
}

5. Update Scripts

Edit

scripts/shared/shared.constants.ts
to add agent to
ALL_AGENTS
:

export const ALL_AGENTS: Agent[] = ["claude-code", "gemini", "droid", "codex", "<agent>"]

Also update the

Agent
type in
scripts/shared/shared.types.ts
:

type Agent = "claude-code" | "gemini" | "droid" | "codex" | "<agent>"

6. Test

docker compose build <agent>
docker compose run --rm -e SEARCH_PROVIDER=builtin <agent>
docker compose run --rm -e SEARCH_PROVIDER=you <agent>

Adding a New MCP Tool

1. Add to mcp-servers.ts

export const MCP_SERVERS = {
  you: { /* ... existing */ },
  exa: {
    name: 'exa-server',
    type: 'http' as const,
    url: 'https://api.exa.ai/mcp',
    auth: {
      type: 'bearer' as const,
      envVar: 'EXA_API_KEY',
    },
  },
} as const

export type McpServerKey = keyof typeof MCP_SERVERS

2. Update docker/entrypoint

Add the new tool case to

configureMcp()
for each agent:

case 'claude-code': {
  await $`claude mcp add --transport http ${server.name} ${server.url} --header "Authorization: Bearer ${apiKey}"`.quiet()
  break
}

case 'gemini': {
  await $`gemini mcp add --transport http --header "Authorization: Bearer ${apiKey}" ${server.name} ${server.url}`.quiet()
  break
}

case 'droid': {
  await $`droid mcp add ${server.name} ${server.url} --type http --header "Authorization: Bearer ${apiKey}"`.quiet()
  break
}

case 'codex': {
  const configDir = `${process.env.HOME}/.codex`
  await $`mkdir -p ${configDir}`.quiet()
  const config = `[mcp_servers.${server.name}]
url = "${server.url}"
bearer_token_env_var = "${server.auth?.envVar}"
`
  await Bun.write(`${configDir}/config.toml`, config)
  break
}

3. Update Environment Files

Add to

.env
and
.env.example
:

EXA_API_KEY=your_api_key_here

4. Generate MCP Prompt Sets

Use the generate-mcp-prompts script to create MCP variant files with proper metadata:

# Generate variants for new MCP server
bun scripts/generate-mcp-prompts.ts --mcp-key exa

# Creates:
# - data/prompts/prompts-exa.jsonl

The script prepends

"Use {server-name} and answer\n"
to each query and adds MCP metadata (server name and expected tools).

The entrypoint automatically handles provider-specific prompt files:

const promptFile = SEARCH_PROVIDER === "builtin"
  ? `/eval/data/prompts/prompts.jsonl`
  : `/eval/data/prompts/prompts-${SEARCH_PROVIDER}.jsonl`  // e.g., prompts-exa.jsonl

Note:

scripts/run-trials.ts
automatically picks up new MCP servers from
mcp-servers.ts
, so no manual updates needed.

5. Test

docker compose build
docker compose run --rm -e SEARCH_PROVIDER=exa claude-code
bun run trials -- --search-provider exa --count 5 -k 1

Schema Format Reference

Current agent schemas:

SchemaAgentModeStatus
claude-code.json
Claude Codestream✅ Tested
gemini.json
Gemini CLIiterative✅ Tested
droid.json
Droid CLIstream✅ Tested
codex.json
Codex CLIstream✅ Tested

Session Modes:

  • stream
    - Process stays alive, multi-turn conversations via stdin
  • iterative
    - New process per turn, history passed as context

Related Skills