Claude-skill-registry finetune-design

Use when preparing to fine-tune an LLM for multi-turn conversations, before generating any training data. Triggers - starting a fine-tuning project, need to define evaluation criteria, designing conversation data generation.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/finetune-design" ~/.claude/skills/majiayu000-claude-skill-registry-finetune-design && rm -rf "$T"
manifest: skills/data/finetune-design/SKILL.md
source content

Fine-tune Design

Design all artifacts needed before generating training data for multi-turn conversation fine-tuning.

Inputs

  • Domain to fine-tune for (customer support, coaching, tutoring, etc.)
  • Deployment constraints (hardware, offline requirement, budget)
  • Access to domain expertise (or ability to research it)

Outputs

By the end of this phase, you will have:

  • model-choice.md
    — Selected model with documented tradeoffs
  • config/input-taxonomy.yaml
    — Topics, styles, difficulty, edge cases
  • config/rubric.yaml
    — Binary criteria with calibration examples
  • config/persona-template.yaml
    — Diversity dimensions and distributions
  • config/prompts/user_sim.md
    — User simulator prompt
  • config/prompts/assistant.md
    — Assistant generation prompt
  • config/system-prompt.md
    — System prompt for training data
  • base-model-eval-results.md
    — Baseline evaluation results

Required Technique: Expert Role-Play Critique

Apply this to EVERY design artifact. Role-play domain experts (real or fictional) to stress-test your designs before committing.

Apply ToExperts to Consider
TaxonomyDomain practitioners, user researchers, edge case specialists
RubricQuality experts, safety specialists, methodology creators
PersonasUser advocates, accessibility experts, diverse user representatives
PromptsDomain practitioners, AI safety researchers, communication experts

Process:

  1. Identify 5-7 relevant experts for your domain
  2. Have Claude role-play each expert critiquing your design
  3. Ask: "What would pass this but still be inadequate? What user populations does this miss?"
  4. Synthesize feedback into improvements

This catches blind spots invisible from your own perspective. One project discovered 6 critical rubric gaps through expert critique that would have corrupted training data.

Full guide: assessment-guide.md#expert-role-play-critique


Workflow

Step 1: Base Model Selection

Select the model you'll fine-tune based on:

FactorWhy It Matters
Context windowMax conversation length you can train on
Quantization supportGGUF, MLX, QAT for local deployment
Base capabilityEvaluate before committing
Training costLoRA/QLoRA vs full fine-tune
Deployment targetOllama, llama.cpp, MLX

Gate: Model chosen with documented tradeoffs in

model-choice.md

Reference: model-selection-guide.md


Step 2: Token Economics

Determine training constraints based on cost:

Tokens/ExampleCost Impact
<8KCheapest, short conversations only
8-16KCost-effective, moderate conversations
16-32KExpensive, long conversations
>32KVery expensive, may require special handling

Constraint: Plan max conversation length based on your budget. 16K is a practical ceiling for most projects.

Gate: Max transcript token length defined

Reference: model-selection-guide.md#token-economics


Step 3: Input Taxonomy

Define the distribution of inputs to generate. A good taxonomy has multiple dimensions:

DimensionQuestionExamples
WHATWhat are they asking about?Topics, subtopics
HOWHow do they communicate?Style, verbosity, tone
WHOWho are they?Demographics, context
DIFFICULTYHow hard is this to handle?Easy, medium, hard
EDGE CASESWhat should trigger special handling?Boundaries, safety

Key lesson: Allocate ~15% to edge cases. Without explicit representation, the model won't learn to handle them.

→ Apply Expert Role-Play: Have domain experts critique your taxonomy for missing topics and user types.

Gate: Weighted taxonomy with cross-product dimensions in

config/input-taxonomy.yaml

Reference: taxonomy-guide.md


Step 4: Evaluation Rubric

Design quality criteria for assessing generated conversations.

Critical requirements:

  • Binary judgments (YES/NO/NA) — not numeric scales
  • Grouped into weighted categories
  • Safety gates that auto-reject on failure
  • 3-8 calibration examples per criterion (essential for multi-backend consistency)

Why calibration examples are non-negotiable: During generation, you'll run assessment with multiple LLM backends (Claude, GPT, Gemini) to catch blind spots. Without calibration examples, backends interpret criteria differently — 20-30% disagreement is common. Calibration examples anchor consistent interpretation.

Structure:

categories:
  comprehension:
    weight: 0.15
    criteria: [CQ1, CQ2]
  # ... more categories

criteria:
  CQ1:
    name: "Accurate understanding"
    question: "Does the response demonstrate accurate understanding?"
    na_valid: false  # Must always be assessable
    calibration_examples:
      - type: PASS
        context: "..."
        response: "..."
        reasoning: "..."
      - type: FAIL
        # ...

safety_gates: [CQ8, CQ9]  # Any failure = auto-reject
pass_threshold: 0.80

→ Apply Expert Role-Play: Have quality experts critique your criteria for blind spots and edge cases.

Gate: Rubric with calibration examples in

config/rubric.yaml

Reference: rubric-guide.md


Step 5: Persona Template

Design user diversity for realistic training data.

Dimensions to define:

  • Communication style (terse, verbose, emotional, analytical)
  • Behavior patterns / "flaws" (resistance, deflection, etc.)
  • Domain-specific attributes (varies by domain)

Key lesson: Flaws vary per message, not per conversation. Real people have good days and bad days.

→ Apply Expert Role-Play: Have user advocates critique your personas for missing populations and unrealistic patterns.

persona_template:
  communication_style:
    options: [terse, casual, formal, stream-of-consciousness]
    weights: [0.15, 0.50, 0.25, 0.10]

  flaw_patterns:
    primary: # 50% chance per message
    secondary: # 20% chance each per message

  # 20% of personas should have NO flaw patterns

Gate: Persona template with distributions in

config/persona-template.yaml

Reference: persona-guide.md


Step 6: Prompts

Create the three prompts for data generation:

PromptPurpose
User simulatorGenerate realistic user messages with flaws
AssistantGenerate high-quality responses
System promptWhat gets baked into training data

Key lessons for assistant prompt:

  • Length matching: Target 1.0-1.5x user word count, hard limit 2x
  • Tentative language for interpretations ("I wonder if..." not "You are...")
  • Question discipline: At most 1-2 questions per response
  • Anti-patterns list: Specific phrases to avoid

→ Apply Expert Role-Play: Have domain experts critique your prompts for missing requirements and problematic patterns.

Gate: All three prompts drafted

Reference: generation-guide.md (in finetune-generate)


Step 7: Base Model Evaluation

Before committing to fine-tune, evaluate the base model on your rubric.

Process:

  1. Generate 10-20 test scenarios covering your taxonomy
  2. Have base model respond to each
  3. Assess with your rubric
  4. Calculate pass rate

Decision gate:

Pass RateRecommendation
>70%Base model may be sufficient. Consider prompt engineering first.
50-70%Fine-tuning likely helpful. Moderate improvement expected.
<50%Fine-tuning needed. Significant improvement expected.

Gate: Base model evaluated, decision to proceed documented in

base-model-eval-results.md

A Note on Numbers

All numeric parameters in these guides (15% edge cases, 50%/20% flaw probabilities, 0.80 pass threshold, etc.) are starting points from one successful project, not universal truths. Calibrate them for your domain based on pilot generation results and human review.

Red Flags: Rationalizations to Resist

RationalizationReality
"Base model is obviously not good enough"Evaluate anyway. You need baseline numbers for comparison.
"I'll use numeric scales (1-5), it's fine"Numeric scales drift across assessors. Binary judgments are consistent.
"Calibration examples are overkill"Without examples, backends interpret criteria differently. 20-30% disagreement.
"Edge cases are rare, skip them"Without ~15% edge case representation, model fails at boundaries.
"I know what users want, skip taxonomy"Your intuition is biased. Formal taxonomy ensures coverage.
"Expert role-play takes too long"1 hour of critique catches blind spots that corrupt 100+ transcripts. Do it.

Done When

  • All 8 output files created
  • Expert role-play critique applied to taxonomy, rubric, personas, and prompts
  • Base model evaluated against rubric
  • Decision to proceed with fine-tuning documented
  • Ready to start finetune-generate phase

Resources

ResourceWhat It Contains
code/SETUP-REFERENCE.mdProject structure and file templates
code/infrastructure.pyCopy-paste ready: LLM backend, checkpointing, slicing, scoring
examples/therapy-domain.mdComplete therapy domain example: taxonomy, flaws, rubric criteria

Next Phase

finetune-generate