Marketplace pinchbench

Run PinchBench benchmarks to evaluate OpenClaw agent performance across real-world tasks. Use when testing model capabilities, comparing models, submitting benchmark results to the leaderboard, or checking how well your OpenClaw setup handles calendar, email, research, coding, and multi-step workflows.

install
source · Clone the upstream repo
git clone https://github.com/aiskillstore/marketplace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aiskillstore/marketplace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/pinchbench/pinchbench" ~/.claude/skills/aiskillstore-marketplace-pinchbench && rm -rf "$T"
manifest: skills/pinchbench/pinchbench/SKILL.md
source content

PinchBench Benchmark Skill

PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Results are collected on a public leaderboard at pinchbench.com.

Prerequisites

  • Python 3.10+
  • uv package manager
  • OpenClaw instance (this agent)

Quick Start

cd <skill_directory>

# Run benchmark with a specific model
uv run benchmark.py --model anthropic/claude-sonnet-4

# Run only automated tasks (faster)
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite automated-only

# Run specific tasks
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite task_01_calendar,task_02_stock

# Skip uploading results
uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload

Available Tasks (23)

TaskCategoryDescription
task_00_sanity
BasicVerify agent works
task_01_calendar
ProductivityCalendar event creation
task_02_stock
ResearchStock price lookup
task_03_blog
WritingBlog post creation
task_04_weather
CodingWeather script
task_05_summary
AnalysisDocument summarization
task_06_events
ResearchConference research
task_07_email
WritingEmail drafting
task_08_memory
MemoryContext retrieval
task_09_files
FilesFile structure creation
task_10_workflow
IntegrationMulti-step API workflow
task_11_clawdhub
SkillsClawHub interaction
task_12_skill_search
SkillsSkill discovery
task_13_image_gen
CreativeImage generation
task_14_humanizer
WritingText humanization
task_15_daily_summary
ProductivityDaily digest
task_16_email_triage
EmailInbox triage
task_17_email_search
EmailEmail search
task_18_market_research
ResearchMarket analysis
task_19_spreadsheet_summary
AnalysisSpreadsheet analysis
task_20_eli5_pdf_summary
AnalysisPDF simplification
task_21_openclaw_comprehension
KnowledgeOpenClaw docs comprehension
task_22_second_brain
MemoryKnowledge management

Command Line Options

OptionDescription
--model
Model identifier (e.g.,
anthropic/claude-sonnet-4
)
--suite
all
,
automated-only
, or comma-separated task IDs
--output-dir
Results directory (default:
results/
)
--timeout-multiplier
Scale task timeouts for slower models
--runs
Number of runs per task for averaging
--no-upload
Skip uploading to leaderboard
--register
Request new API token for submissions
--upload FILE
Upload previous results JSON

Token Registration

To submit results to the leaderboard:

# Register for an API token (one-time)
uv run benchmark.py --register

# Run benchmark (auto-uploads with token)
uv run benchmark.py --model anthropic/claude-sonnet-4

Results

Results are saved as JSON in the output directory:

# View task scores
jq '.tasks[] | {task_id, score: .grading.mean}' results/0001_anthropic-claude-sonnet-4.json

# Show failed tasks
jq '.tasks[] | select(.grading.mean < 0.5)' results/*.json

# Calculate overall score
jq '{average: ([.tasks[].grading.mean] | add / length)}' results/*.json

Adding Custom Tasks

Create a markdown file in

tasks/
following
TASK_TEMPLATE.md
. Each task needs:

  • YAML frontmatter (id, name, category, grading_type, timeout)
  • Prompt section
  • Expected behavior
  • Grading criteria
  • Automated checks (Python grading function)

Leaderboard

View results at pinchbench.com. The leaderboard shows:

  • Model rankings by overall score
  • Per-task breakdowns
  • Historical performance trends