Claude-skill-registry data-refresh-eval

Build and refresh eval datasets from Front, run routing evals, and analyze agent response quality.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/data-refresh-eval" ~/.claude/skills/majiayu000-claude-skill-registry-data-refresh-eval && rm -rf "$T"

manifest: skills/data/data-refresh-eval/SKILL.md

Data Refresh & Eval Skill

Workflow for keeping the eval dataset fresh and running quality checks on agent responses.

Quick Start

cd ~/Code/skillrecordings/support/packages/cli

# Refresh dataset from Front (last 30 days, 200 responses max)
bun src/index.ts dataset build --since $(date -d "30 days ago" +%Y-%m-%d) --limit 200 --output data/eval-dataset.json

# Run routing eval
bun src/index.ts eval routing data/eval-dataset.json

Dataset Commands

Build fresh dataset

# Recent data (recommended for ongoing work)
bun src/index.ts dataset build --since 2025-01-01 --limit 200 --output data/eval-dataset.json

# App-specific
bun src/index.ts dataset build --app total-typescript --limit 100 --output data/tt-dataset.json

# Include conversation history for context
bun src/index.ts dataset build --since 2025-01-01 --include-history --output data/dataset-with-history.json

# Only labeled responses (good/bad)
bun src/index.ts dataset build --labeled-only --output data/labeled-only.json

Convert to evalite format

bun src/index.ts dataset to-evalite -i data/eval-dataset.json -o data/evalite-format.json

Running Evals

Routing eval (default thresholds)

bun src/index.ts eval routing data/eval-dataset.json

Custom thresholds

bun src/index.ts eval routing data/eval-dataset.json \
  --min-precision 0.95 \
  --min-recall 0.98 \
  --max-fp-rate 0.02 \
  --max-fn-rate 0.01

JSON output for CI/automation

bun src/index.ts eval routing data/eval-dataset.json --json

Response Analysis

Find bad responses for debugging

# List responses rated "bad"
bun src/index.ts responses list --rating bad

# Get details with conversation context
bun src/index.ts responses get <actionId> --context

# Export bad responses for analysis
bun src/index.ts responses export --rating bad -o bad-responses.json

Analyze unrated responses

bun src/index.ts responses list --rating unrated --limit 50

Recommended Workflow

Daily data refresh

cd ~/Code/skillrecordings/support/packages/cli

# 1. Pull fresh data
bun src/index.ts dataset build --since $(date -d "7 days ago" +%Y-%m-%d) --limit 100 --output data/eval-dataset.json

# 2. Check dataset stats
cat data/eval-dataset.json | jq 'length'

# 3. Run eval
bun src/index.ts eval routing data/eval-dataset.json

# 4. Check for failures
bun src/index.ts responses list --rating bad --limit 10

Pre-deploy validation

# 1. Build comprehensive dataset
bun src/index.ts dataset build --since 2025-01-01 --limit 500 --output data/full-dataset.json

# 2. Run eval with strict thresholds
bun src/index.ts eval routing data/full-dataset.json --min-precision 0.95 --min-recall 0.98 --json

# 3. Check exit code
echo "Exit code: $?"

Dataset Schema

Each eval point includes:

```
id
```
- Action ID
```
app
```
- App slug (total-typescript, aihero, etc.)
```
conversationId
```
- Front conversation ID
```
customerEmail
```
- Customer email (if available)
```
triggerMessage
```
- The inbound message that triggered the response
- ```
subject
```
  ,
```
body
```
  ,
```
timestamp
```
```
agentResponse
```
- The agent's drafted response
- ```
text
```
  ,
```
category
```
  ,
```
timestamp
```
```
label
```
- "good" | "bad" | undefined
```
labeledBy
```
- Who approved/rejected
```
conversationHistory
```
- (optional) Full message history

Environment

Required in

.env.local

FRONT_API_TOKEN=          # Front API access
DATABASE_URL=             # Database connection

Troubleshooting

"FRONT_API_TOKEN environment variable required"

source apps/front/.env.local
# or set in .env.local at repo root

Dataset building slowly

Front API rate limits. Use

--limit

to control batch size.

No labeled data

Labels come from HITL approvals/rejections. New responses start unlabeled.