Skillgrade skillgrade-setup

Sets up and runs skillgrade evaluation pipelines for Agent Skills. Use when initializing eval configurations, running trials, reviewing results, or integrating with CI. Don't use for writing grader scripts, general test authoring, or non-agentic documentation.

install

source · Clone the upstream repo

git clone https://github.com/mgechev/skillgrade

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/mgechev/skillgrade "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/skillgrade-setup" ~/.claude/skills/mgechev-skillgrade-skillgrade-setup && rm -rf "$T"

manifest: skills/skillgrade-setup/SKILL.md

source content

Skillgrade Evaluation Setup

Procedures

Step 1: Install Skillgrade

Verify Node.js 20+ and Docker are available.
Run
```
npm i -g skillgrade
```
to install the CLI globally.

Step 2: Initialize an Eval Configuration

Navigate to the skill directory (must contain a
```
SKILL.md
```
).
Set the appropriate API key environment variable (
```
GEMINI_API_KEY
```
,
```
ANTHROPIC_API_KEY
```
, or
```
OPENAI_API_KEY
```
).
Run
```
skillgrade init
```
to generate an
```
eval.yaml
```
with AI-powered tasks and graders.
If an
```
eval.yaml
```
already exists, pass
```
--force
```
to overwrite:
```
skillgrade init --force
```
.
Without an API key, a well-commented template is generated instead.

Step 3: Configure eval.yaml

Read
```
references/eval-yaml-spec.md
```
for the full configuration schema.
Define one or more tasks under the
```
tasks:
```
key. Each task requires:
- ```
name
```
  : unique task identifier
- ```
instruction
```
  : what the agent should accomplish
- ```
workspace
```
  : files to copy into the evaluation container
- ```
graders
```
  : one or more scoring mechanisms (see the
```
skillgrade-graders
```
  skill)
Optionally configure
```
defaults:
```
for agent, provider, trials, timeout, and threshold.

Step 4: Run Evaluations

Select an appropriate preset based on the evaluation goal:
- ```
--smoke
```
  (5 trials): Quick capability check.
- ```
--reliable
```
  (15 trials): Reliable pass rate estimate.
- ```
--regression
```
  (30 trials): High-confidence regression detection.
Run the evaluation:
```
skillgrade --smoke
```
.
Run a specific eval by name:
```
skillgrade --eval=fix-linting
```
.

Run multiple evals:

skillgrade --eval=fix-linting,write-tests

Run only deterministic graders (skip LLM calls):
```
skillgrade --grader=deterministic
```
.
Run only LLM rubric graders:
```
skillgrade --grader=llm_rubric
```
.
The agent is auto-detected from the API key. Override with
```
--agent=gemini|claude|codex
```
.
Override the provider with
```
--provider=docker|local
```
.

Step 5: Review Results

Run
```
skillgrade preview
```
for a CLI report.

Run

skillgrade preview browser

to open the web UI at

http://localhost:3847

Reports are saved to

$TMPDIR/skillgrade/<skill-name>/results/

. Override with

--output=DIR

Step 6: Integrate with CI

Add a GitHub Actions step that installs skillgrade, navigates to the skill directory, and runs with
```
--regression --ci --provider=local
```
.
Use
```
--provider=local
```
in CI — the runner is already an ephemeral sandbox, so Docker adds overhead without benefit.
The
```
--ci
```
flag causes a non-zero exit code if the pass rate falls below
```
--threshold
```
(default: 0.8).
Read
```
references/ci-example.md
```
for a complete workflow template.

Error Handling

If
```
skillgrade init
```
fails with "No SKILL.md found," verify the current directory contains a valid
```
SKILL.md
```
file.
If evaluation hangs, check Docker is running and the container has network access for API calls.
If all trials fail with "No API key," ensure the environment variable is exported, not just set inline for a different command.