Claude-skill-registry data-refresh-eval

Build and refresh eval datasets from Front, run routing evals, and analyze agent response quality.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/data-refresh-eval" ~/.claude/skills/majiayu000-claude-skill-registry-data-refresh-eval && rm -rf "$T"
manifest: skills/data/data-refresh-eval/SKILL.md
source content

Data Refresh & Eval Skill

Workflow for keeping the eval dataset fresh and running quality checks on agent responses.

Quick Start

cd ~/Code/skillrecordings/support/packages/cli

# Refresh dataset from Front (last 30 days, 200 responses max)
bun src/index.ts dataset build --since $(date -d "30 days ago" +%Y-%m-%d) --limit 200 --output data/eval-dataset.json

# Run routing eval
bun src/index.ts eval routing data/eval-dataset.json

Dataset Commands

Build fresh dataset

# Recent data (recommended for ongoing work)
bun src/index.ts dataset build --since 2025-01-01 --limit 200 --output data/eval-dataset.json

# App-specific
bun src/index.ts dataset build --app total-typescript --limit 100 --output data/tt-dataset.json

# Include conversation history for context
bun src/index.ts dataset build --since 2025-01-01 --include-history --output data/dataset-with-history.json

# Only labeled responses (good/bad)
bun src/index.ts dataset build --labeled-only --output data/labeled-only.json

Convert to evalite format

bun src/index.ts dataset to-evalite -i data/eval-dataset.json -o data/evalite-format.json

Running Evals

Routing eval (default thresholds)

bun src/index.ts eval routing data/eval-dataset.json

Custom thresholds

bun src/index.ts eval routing data/eval-dataset.json \
  --min-precision 0.95 \
  --min-recall 0.98 \
  --max-fp-rate 0.02 \
  --max-fn-rate 0.01

JSON output for CI/automation

bun src/index.ts eval routing data/eval-dataset.json --json

Response Analysis

Find bad responses for debugging

# List responses rated "bad"
bun src/index.ts responses list --rating bad

# Get details with conversation context
bun src/index.ts responses get <actionId> --context

# Export bad responses for analysis
bun src/index.ts responses export --rating bad -o bad-responses.json

Analyze unrated responses

bun src/index.ts responses list --rating unrated --limit 50

Recommended Workflow

Daily data refresh

cd ~/Code/skillrecordings/support/packages/cli

# 1. Pull fresh data
bun src/index.ts dataset build --since $(date -d "7 days ago" +%Y-%m-%d) --limit 100 --output data/eval-dataset.json

# 2. Check dataset stats
cat data/eval-dataset.json | jq 'length'

# 3. Run eval
bun src/index.ts eval routing data/eval-dataset.json

# 4. Check for failures
bun src/index.ts responses list --rating bad --limit 10

Pre-deploy validation

# 1. Build comprehensive dataset
bun src/index.ts dataset build --since 2025-01-01 --limit 500 --output data/full-dataset.json

# 2. Run eval with strict thresholds
bun src/index.ts eval routing data/full-dataset.json --min-precision 0.95 --min-recall 0.98 --json

# 3. Check exit code
echo "Exit code: $?"

Dataset Schema

Each eval point includes:

  • id
    - Action ID
  • app
    - App slug (total-typescript, aihero, etc.)
  • conversationId
    - Front conversation ID
  • customerEmail
    - Customer email (if available)
  • triggerMessage
    - The inbound message that triggered the response
    • subject
      ,
      body
      ,
      timestamp
  • agentResponse
    - The agent's drafted response
    • text
      ,
      category
      ,
      timestamp
  • label
    - "good" | "bad" | undefined
  • labeledBy
    - Who approved/rejected
  • conversationHistory
    - (optional) Full message history

Environment

Required in

.env.local
:

FRONT_API_TOKEN=          # Front API access
DATABASE_URL=             # Database connection

Troubleshooting

"FRONT_API_TOKEN environment variable required"

source apps/front/.env.local
# or set in .env.local at repo root

Dataset building slowly

Front API rate limits. Use

--limit
to control batch size.

No labeled data

Labels come from HITL approvals/rejections. New responses start unlabeled.