Claude-skill-registry finetune-generate

Use when generating synthetic training data for multi-turn conversation fine-tuning. Triggers - have design artifacts ready, need to generate conversations, ready to assess quality. Requires finetune-design first.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/finetune-generate" ~/.claude/skills/majiayu000-claude-skill-registry-finetune-generate && rm -rf "$T"
manifest: skills/data/finetune-generate/SKILL.md
source content

Fine-tune Generate

Iteratively generate and filter training data until quality stabilizes.

Prerequisites

Complete finetune-design first. You need:

  • Model choice and token constraints
  • Input taxonomy
  • Evaluation rubric with calibration examples
  • Persona template
  • User simulator, assistant, and system prompts

Outputs

By the end of this phase, you will have:

  • training_data.jsonl
    — Filtered, sliced training examples
  • generation_stats.md
    — Pass rates, criterion breakdown, iterations
  • prompt_versions/
    — History of prompt iterations

The Core Loop

This is the most important part of the entire pipeline.

┌─────────────────────────────────────────────────────────────┐
│  TIGHT LOOP (5 transcripts per iteration)                   │
│                                                             │
│  1. Generate 5 transcripts                                  │
│  2. Assess with rubric (all backends)                       │
│  3. HUMAN REVIEWS both transcripts AND assessments          │
│  4. Iterate based on human judgment                         │
│  5. Repeat until ≥70% pass rate AND human satisfied         │
│                                                             │
│  Then: Scale to full volume                                 │
└─────────────────────────────────────────────────────────────┘

Why 5 Transcripts?

  • Small enough for human to actually READ each one carefully
  • Fast feedback (minutes, not hours)
  • See patterns without wasting compute
  • Iterate while context is fresh

Why Human-in-the-Loop? (Non-Negotiable)

Human review is required, not optional. The human reviews BOTH transcripts AND assessment results:

Human reviews...Looking for...
TranscriptsQuality issues the rubric might miss
Assessment resultsFalse positives (passed but shouldn't have)
Assessment resultsFalse negatives (failed but seems fine)
Both togetherGaps in what the rubric even checks

Without human review:

  • You're optimizing against a potentially broken metric
  • False positives silently corrupt training data
  • Rubric blind spots never get discovered

Red Flags: Rationalizations to Resist

RationalizationReality
"Human review slows us down"Skipping review = optimizing against broken metric. 1 hour of review saves days of bad data.
"Pass rate is high, must be fine"High pass rate with single backend misses 20-30% of issues. Multi-backend + human review required.
"We can add calibration examples later"Without calibration examples, backends disagree silently. Add them during design.
"The rubric is complete"Rubrics evolve (e.g., 12→18 criteria). New failure modes emerge.
"One assessor backend is enough"Single backend gave transcript 1000 perfect 1.0; other backends caught 4 failures.
"Let's just scale and filter later"Scaling before 70% pass rate wastes compute. Fix prompts first.

If you catch yourself using any of these rationalizations: STOP. Follow the gates.

Dual Iteration

You iterate on TWO things, not one:

When you see...Iterate on...
Transcript quality issuesGeneration prompts (user-sim, assistant)
Assessment seems wrongAssessor prompt, criteria wording
Backend disagreementCalibration examples for that criterion
Missing failure modeAdd new criterion to rubric
Pass rates high but something feels offRun expert role-play critique

The rubric is never "done." In one project, criteria evolved: 12 → 14 → 16 → 17 → 18.

Expert role-play critique is required — periodically have Claude role-play domain experts to critique your rubric and small transcript batch directly. This catches blind spots invisible from your own perspective. See assessment-guide.md#expert-role-play-critique.


Workflow

Step 1: Tight Iteration Loop

For each batch of 5 transcripts:

  1. Generate 5 transcripts using two-agent simulation
  2. Assess with rubric using multiple backends (Claude, Gemini, GPT-5)
  3. Human reviews both transcripts and assessments:
    • Read each transcript: Is this actually good?
    • Read each assessment: Did the rubric catch what matters?
    • Note: false positives, false negatives, missing criteria
  4. Iterate based on human judgment:
    • Fix generation prompts (if transcript quality issues)
    • Fix assessor prompt/criteria (if assessment issues)
    • Add calibration examples (if edge cases found)
  5. Repeat until quality stabilizes

Gate (before scaling):

ConditionAction
≥70% pass rate AND human satisfiedProceed to scale
50-70% OR human sees issuesContinue iterating
<50%Major revision needed

Reference: generation-guide.md, assessment-guide.md


Step 2: Scale Generation

Once the tight loop stabilizes:

  1. Generate target volume (100+ transcripts)
  2. Continue assessment with same multi-backend approach
  3. Periodic human spot-checks (every 20-50 transcripts)
  4. Track statistics (pass rate, criterion breakdown)

Warning signs during scale:

  • Pass rate drifting down → Revisit prompts
  • New failure patterns emerging → Add criteria
  • Perfect scores (1.0) → Suspiciously high, investigate

Step 3: Audit Patterns

Run quantitative analysis on the full dataset to catch issues invisible in spot-checks:

CheckRed FlagAction
Phrase repetitionAny phrase in >50% of responsesAdd to anti-patterns, regenerate
Structural rigidity100% same formatVary response structure
Response length ratioAvg >2x user lengthTighten length constraints
Praise distributionLate responses 2x more praiseAdjust tone consistency

Gate: No audit red flags

Reference: assessment-guide.md#audit-patterns


Step 4: Fixup or Reject

For failing transcripts, decide whether to fix or reject:

Failure TypeAction
Soft failures (language, tone)Attempt fixup with entailment constraint
Safety gate failuresTruncate at failure point or reject entirely
Structural issuesUsually reject

Entailment constraint: Fixed response must naturally lead to user's next message. If fix breaks continuity → truncate instead.

If >30% need fixup: Generation prompts need revision.

Reference: assessment-guide.md#fixup-strategy


Step 5: Slice for Training

Create training examples from full transcripts:

50-turn transcript → ~8-10 training examples via slicing

Slicing strategy:

  • Random slice points (seeded by transcript ID for reproducibility)
  • Minimum 3 exchanges before first slice
  • 2-5 exchange gaps between slices
  • Always include final turn

Token validation:

  • Each slice must be under your token limit (e.g., 16K)
  • Long transcripts may need truncation

Leakage prevention:

  • Split by transcript/persona FIRST
  • Then slice within each split
  • Never let slices from same transcript in both train and validation

Reference: assessment-guide.md#slicing-strategy

Optional: Use

hugging-face-dataset-creator
skill when ready to push
training_data.jsonl
to HF Hub.


Infrastructure

Checkpointing

Write progress after each transcript, not at the end:

for persona in personas:
    transcript = generate_transcript(persona)
    save_immediately(transcript)  # Don't batch

Retry with Backoff

API failures will happen. Use exponential backoff:

  • Claude: 7 attempts, 1-hour max wait
  • Google: Extract retry delay from error message
  • OpenAI: Standard exponential backoff

Progress Tracking

Track throughout generation:

  • Transcripts generated / target
  • Transcripts assessed / generated
  • Pass rate (rolling and cumulative)
  • Criterion failure breakdown

Reference: assessment-guide.md#infrastructure


Resources

ResourceWhat It Contains
code/SETUP-REFERENCE.mdScript templates: generate.py, assess.py, slice.py
code/infrastructure.pyCopy-paste ready: LLM backend, retry strategies, checkpointing
examples/therapy-domain.mdComplete therapy example: prompts, flaw patterns, criteria

Done When

  • Target training example count reached
  • Pass rate stable across last 2-3 batches (≥70%)
  • Human satisfied with transcript quality
  • Audit patterns within thresholds
  • training_data.jsonl
    validated

Next Phase

finetune-train