Skillgrade skillgrade-setup
Sets up and runs skillgrade evaluation pipelines for Agent Skills. Use when initializing eval configurations, running trials, reviewing results, or integrating with CI. Don't use for writing grader scripts, general test authoring, or non-agentic documentation.
install
source · Clone the upstream repo
git clone https://github.com/mgechev/skillgrade
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mgechev/skillgrade "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/skillgrade-setup" ~/.claude/skills/mgechev-skillgrade-skillgrade-setup && rm -rf "$T"
manifest:
skills/skillgrade-setup/SKILL.mdsource content
Skillgrade Evaluation Setup
Procedures
Step 1: Install Skillgrade
- Verify Node.js 20+ and Docker are available.
- Run
to install the CLI globally.npm i -g skillgrade
Step 2: Initialize an Eval Configuration
- Navigate to the skill directory (must contain a
).SKILL.md - Set the appropriate API key environment variable (
,GEMINI_API_KEY
, orANTHROPIC_API_KEY
).OPENAI_API_KEY - Run
to generate anskillgrade init
with AI-powered tasks and graders.eval.yaml - If an
already exists, passeval.yaml
to overwrite:--force
.skillgrade init --force - Without an API key, a well-commented template is generated instead.
Step 3: Configure eval.yaml
- Read
for the full configuration schema.references/eval-yaml-spec.md - Define one or more tasks under the
key. Each task requires:tasks:
: unique task identifiername
: what the agent should accomplishinstruction
: files to copy into the evaluation containerworkspace
: one or more scoring mechanisms (see thegraders
skill)skillgrade-graders
- Optionally configure
for agent, provider, trials, timeout, and threshold.defaults:
Step 4: Run Evaluations
- Select an appropriate preset based on the evaluation goal:
(5 trials): Quick capability check.--smoke
(15 trials): Reliable pass rate estimate.--reliable
(30 trials): High-confidence regression detection.--regression
- Run the evaluation:
.skillgrade --smoke - Run a specific eval by name:
.skillgrade --eval=fix-linting - Run multiple evals:
.skillgrade --eval=fix-linting,write-tests - Run only deterministic graders (skip LLM calls):
.skillgrade --grader=deterministic - Run only LLM rubric graders:
.skillgrade --grader=llm_rubric - The agent is auto-detected from the API key. Override with
.--agent=gemini|claude|codex - Override the provider with
.--provider=docker|local
Step 5: Review Results
- Run
for a CLI report.skillgrade preview - Run
to open the web UI atskillgrade preview browser
.http://localhost:3847 - Reports are saved to
. Override with$TMPDIR/skillgrade/<skill-name>/results/
.--output=DIR
Step 6: Integrate with CI
- Add a GitHub Actions step that installs skillgrade, navigates to the skill directory, and runs with
.--regression --ci --provider=local - Use
in CI — the runner is already an ephemeral sandbox, so Docker adds overhead without benefit.--provider=local - The
flag causes a non-zero exit code if the pass rate falls below--ci
(default: 0.8).--threshold - Read
for a complete workflow template.references/ci-example.md
Error Handling
- If
fails with "No SKILL.md found," verify the current directory contains a validskillgrade init
file.SKILL.md - If evaluation hangs, check Docker is running and the container has network access for API calls.
- If all trials fail with "No API key," ensure the environment variable is exported, not just set inline for a different command.