Marketplace pinchbench
Run PinchBench benchmarks to evaluate OpenClaw agent performance across real-world tasks. Use when testing model capabilities, comparing models, submitting benchmark results to the leaderboard, or checking how well your OpenClaw setup handles calendar, email, research, coding, and multi-step workflows.
install
source · Clone the upstream repo
git clone https://github.com/aiskillstore/marketplace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aiskillstore/marketplace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/pinchbench/pinchbench" ~/.claude/skills/aiskillstore-marketplace-pinchbench && rm -rf "$T"
manifest:
skills/pinchbench/pinchbench/SKILL.mdsource content
PinchBench Benchmark Skill
PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Results are collected on a public leaderboard at pinchbench.com.
Prerequisites
- Python 3.10+
- uv package manager
- OpenClaw instance (this agent)
Quick Start
cd <skill_directory> # Run benchmark with a specific model uv run benchmark.py --model anthropic/claude-sonnet-4 # Run only automated tasks (faster) uv run benchmark.py --model anthropic/claude-sonnet-4 --suite automated-only # Run specific tasks uv run benchmark.py --model anthropic/claude-sonnet-4 --suite task_01_calendar,task_02_stock # Skip uploading results uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload
Available Tasks (23)
| Task | Category | Description |
|---|---|---|
| Basic | Verify agent works |
| Productivity | Calendar event creation |
| Research | Stock price lookup |
| Writing | Blog post creation |
| Coding | Weather script |
| Analysis | Document summarization |
| Research | Conference research |
| Writing | Email drafting |
| Memory | Context retrieval |
| Files | File structure creation |
| Integration | Multi-step API workflow |
| Skills | ClawHub interaction |
| Skills | Skill discovery |
| Creative | Image generation |
| Writing | Text humanization |
| Productivity | Daily digest |
| Inbox triage | |
| Email search | |
| Research | Market analysis |
| Analysis | Spreadsheet analysis |
| Analysis | PDF simplification |
| Knowledge | OpenClaw docs comprehension |
| Memory | Knowledge management |
Command Line Options
| Option | Description |
|---|---|
| Model identifier (e.g., ) |
| , , or comma-separated task IDs |
| Results directory (default: ) |
| Scale task timeouts for slower models |
| Number of runs per task for averaging |
| Skip uploading to leaderboard |
| Request new API token for submissions |
| Upload previous results JSON |
Token Registration
To submit results to the leaderboard:
# Register for an API token (one-time) uv run benchmark.py --register # Run benchmark (auto-uploads with token) uv run benchmark.py --model anthropic/claude-sonnet-4
Results
Results are saved as JSON in the output directory:
# View task scores jq '.tasks[] | {task_id, score: .grading.mean}' results/0001_anthropic-claude-sonnet-4.json # Show failed tasks jq '.tasks[] | select(.grading.mean < 0.5)' results/*.json # Calculate overall score jq '{average: ([.tasks[].grading.mean] | add / length)}' results/*.json
Adding Custom Tasks
Create a markdown file in
tasks/ following TASK_TEMPLATE.md. Each task needs:
- YAML frontmatter (id, name, category, grading_type, timeout)
- Prompt section
- Expected behavior
- Grading criteria
- Automated checks (Python grading function)
Leaderboard
View results at pinchbench.com. The leaderboard shows:
- Model rankings by overall score
- Per-task breakdowns
- Historical performance trends