Claude-skill-registry Benchmark Manager

Create and manage AILANG eval benchmarks. Use when user asks to create benchmarks, fix benchmark issues, debug failing benchmarks, or analyze benchmark results.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/benchmark-manager" ~/.claude/skills/majiayu000-claude-skill-registry-benchmark-manager && rm -rf "$T"

manifest: skills/data/benchmark-manager/SKILL.md

Benchmark Manager

Manage AILANG evaluation benchmarks with correct prompt integration, debugging workflows, and best practices learned from real benchmark failures.

Quick Start

Debugging a failing benchmark:

# 1. Show the full prompt that models see
.claude/skills/benchmark-manager/scripts/show_full_prompt.sh json_parse

# 2. Test a benchmark with a specific model
ailang eval-suite --models claude-haiku-4-5 --benchmarks json_parse

# 3. Check benchmark YAML for common issues
.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/json_parse.yml

When to Use This Skill

Invoke this skill when:

User asks to create a new benchmark
User asks to debug/fix a failing benchmark
User wants to understand why models generate wrong code
User asks about benchmark YAML format
Benchmarks show 0% pass rate despite language support

CRITICAL: prompt vs task_prompt

This is the most important concept for benchmark management.

The Problem (v0.4.8 Discovery)

Benchmarks have TWO different prompt fields with VERY different behavior:

Field	Behavior	Use When
`prompt:`	REPLACES the teaching prompt entirely	Testing raw model capability (rare)
`task_prompt:`	APPENDS to teaching prompt	Normal benchmarks (99% of cases)

Why This Matters

# BAD - Model never sees AILANG syntax!
prompt: |
  Write a program that prints "Hello"

# GOOD - Model sees teaching prompt + task
task_prompt: |
  Write a program that prints "Hello"

With

prompt:

, models generate Python/pseudo-code because they never learn AILANG syntax.

How Prompts Combine

From

internal/eval_harness/spec.go

(lines 91-93):

fullPrompt := basePrompt  // Teaching prompt from prompts/v0.4.x.md
if s.TaskPrompt != "" {
    fullPrompt = fullPrompt + "\n\n## Task\n\n" + s.TaskPrompt
}

The teaching prompt teaches AILANG syntax;

task_prompt

adds the specific task.

Available Scripts

scripts/show_full_prompt.sh

Shows the complete prompt that models receive for a benchmark.

Usage:

.claude/skills/benchmark-manager/scripts/show_full_prompt.sh <benchmark_id>

# Example:
.claude/skills/benchmark-manager/scripts/show_full_prompt.sh json_parse

scripts/check_benchmark.sh

Validates a benchmark YAML file for common issues.

Usage:

.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/<name>.yml

Checks for:

Using
```
prompt:
```
instead of
```
task_prompt:
```
(warning)
Missing required fields
Invalid capability names
Syntax errors in YAML

scripts/test_benchmark.sh

Runs a quick single-model test of a benchmark.

Usage:

.claude/skills/benchmark-manager/scripts/test_benchmark.sh <benchmark_id> [model]

# Examples:
.claude/skills/benchmark-manager/scripts/test_benchmark.sh json_parse
.claude/skills/benchmark-manager/scripts/test_benchmark.sh json_parse claude-haiku-4-5

Benchmark YAML Format

Required Fields

id: my_benchmark              # Unique identifier (snake_case)
description: "Short description of what this tests"
languages: ["python", "ailang"]
entrypoint: "main"            # Function to call
caps: ["IO"]                  # Required capabilities
difficulty: "easy|medium|hard"
expected_gain: "low|medium|high"
task_prompt: |                # ALWAYS use task_prompt, not prompt!
  Write a program in <LANG> that:
  1. Does something
  2. Prints the result

  Output only the code, no explanations.
expected_stdout: |            # Exact expected output
  expected output here

Capability Names

Valid capabilities:

IO

FS

Clock

Net

# File I/O
caps: ["IO"]

# HTTP requests
caps: ["Net", "IO"]

# File system operations
caps: ["FS", "IO"]

Creating New Benchmarks

Step 1: Determine Requirements

What language feature/capability is being tested?
Can models solve this with just the teaching prompt?
What's the expected output?

Step 2: Write the Benchmark

id: my_new_benchmark
description: "Test feature X capability"
languages: ["python", "ailang"]
entrypoint: "main"
caps: ["IO"]
difficulty: "medium"
expected_gain: "medium"
task_prompt: |
  Write a program in <LANG> that:
  1. Clear description of task
  2. Another step
  3. Print the result

  Output only the code, no explanations.
expected_stdout: |
  exact expected output

Step 3: Validate and Test

# Check for issues
.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/my_new_benchmark.yml

# Test with cheap model first
ailang eval-suite --models claude-haiku-4-5 --benchmarks my_new_benchmark

Debugging Failing Benchmarks

Symptom: 0% Pass Rate Despite Language Support

Check 1: Is it using

task_prompt:

grep -E "^prompt:" benchmarks/failing_benchmark.yml
# If this returns a match, change to task_prompt:

Check 2: What prompt do models see?

.claude/skills/benchmark-manager/scripts/show_full_prompt.sh failing_benchmark

Check 3: Is the teaching prompt up to date?

# After editing prompts/v0.x.x.md, you MUST rebuild:
make quick-install

Symptom: Models Copy Template Instead of Solving Task

The teaching prompt includes a template structure. If models copy it verbatim:

Make sure task is clearly different from examples in teaching prompt
Check that
```
task_prompt
```
explicitly describes what to do
Consider if the task description is ambiguous

Symptom: compile_error on Valid Syntax

Common AILANG-specific issues models get wrong:

Wrong	Correct	Notes
`print(42)`	`print(show(42))`	print expects string
`a % b`	`mod_Int(a, b)`	No % operator
`def main()`	`export func main()`	Wrong keyword
`for x in xs`	`match xs { ... }`	No for loops

If models consistently make these mistakes, the teaching prompt needs improvement (use prompt-manager skill).

Common Mistakes

1. Using

prompt:

Instead of

task_prompt:

# WRONG - Models never see AILANG syntax
prompt: |
  Write code that...

# CORRECT - Teaching prompt + task
task_prompt: |
  Write code that...

2. Forgetting to Rebuild After Prompt Changes

# After editing prompts/v0.x.x.md:
make quick-install  # REQUIRED!

3. Putting Hints in Benchmarks

# WRONG - Hints in benchmark
task_prompt: |
  Write code that prints 42.
  Hint: Use print(show(42)) in AILANG.

# CORRECT - No hints; if models fail, fix the teaching prompt
task_prompt: |
  Write code that prints 42.

If models need AILANG-specific hints, the teaching prompt is incomplete. Use the prompt-manager skill to fix it.

4. Testing Too Many Models at Once

# WRONG - Expensive and slow for debugging
ailang eval-suite --full --benchmarks my_test

# CORRECT - Use one cheap model first
ailang eval-suite --models claude-haiku-4-5 --benchmarks my_test

Resources

Reference Guide

See

resources/reference.md

for:

Complete list of valid benchmark fields
Capability reference
Example benchmarks

Related Skills

prompt-manager: When benchmark failures indicate teaching prompt issues
eval-analyzer: For analyzing results across many benchmarks
use-ailang: For writing correct AILANG code

Notes

Benchmarks live in
```
benchmarks/
```
directory
Eval results go to
```
eval_results/
```
directory
Teaching prompt is embedded in binary - rebuild after changes
Use
```
<LANG>
```
placeholder in task_prompt - it's replaced with "AILANG" or "Python"

Claude-skill-registry Benchmark Manager

Benchmark Manager

Quick Start

When to Use This Skill

CRITICAL: prompt vs task_prompt

The Problem (v0.4.8 Discovery)

Why This Matters

How Prompts Combine

Available Scripts

`scripts/show_full_prompt.sh`

`scripts/check_benchmark.sh`

`scripts/test_benchmark.sh`

Benchmark YAML Format

Required Fields

Capability Names

Creating New Benchmarks

Step 1: Determine Requirements

Step 2: Write the Benchmark

Step 3: Validate and Test

Debugging Failing Benchmarks

Symptom: 0% Pass Rate Despite Language Support

Symptom: Models Copy Template Instead of Solving Task

Symptom: compile_error on Valid Syntax

Common Mistakes

1. Using
`prompt:`
Instead of
`task_prompt:`

2. Forgetting to Rebuild After Prompt Changes

3. Putting Hints in Benchmarks

4. Testing Too Many Models at Once

Resources

Reference Guide

Related Skills

Notes