Skilllibrary instruction-tuning
Performs supervised fine-tuning on instruction-following data using TRL SFTTrainer, chat templates, response-only loss masking, and sequence packing. Covers Alpaca/ShareGPT/OpenAI data formats, data mixing, and quality filtering. Use when training a model to follow instructions or hold conversations.
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/12-ai-llm-training-architecture-and-research/instruction-tuning" ~/.claude/skills/merceralex397-collab-skilllibrary-instruction-tuning && rm -rf "$T"
manifest:
12-ai-llm-training-architecture-and-research/instruction-tuning/SKILL.mdsource content
Purpose
Train LLMs to follow instructions and hold multi-turn conversations via supervised fine-tuning (SFT) on curated instruction data, using HuggingFace TRL, chat templates, response-only loss masking, and efficient sequence packing.
When to use this skill
Use this skill when:
- training a base model to follow instructions or hold multi-turn conversations
- formatting datasets into Alpaca, ShareGPT, or OpenAI chat format for SFT
- configuring
with chat templates and response-only loss maskingSFTTrainer - designing data mixing strategies across conversation, QA, coding, and math tasks
- implementing sequence packing to maximize training throughput
Do not use this skill when
- the task is domain-specific fine-tuning on non-instruction data (use
for task adaptation with LoRA/QLoRA)fine-tuning - the goal is RLHF, DPO, or preference-based alignment (use
)preference-optimization - you only need to evaluate a model's instruction-following (use
)eval-dataset-design
Operating procedure
- Choose data format. Three standard formats:
- Alpaca:
{"instruction": "...", "input": "...", "output": "..."} - ShareGPT:
{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]} - OpenAI:
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
- Alpaca:
- Apply chat template. Use the tokenizer's built-in template:
Ensure the template matches the model's expected format (ChatML, Llama, Zephyr, etc.).from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct") formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) - Configure response-only loss masking. Train only on assistant turns — mask user/system tokens from the loss. In TRL, use
with the assistant response template token:DataCollatorForCompletionOnlyLMfrom trl import DataCollatorForCompletionOnlyLM collator = DataCollatorForCompletionOnlyLM( response_template="<|assistant|>", tokenizer=tokenizer ) - Mix data sources. Combine diverse tasks: 40% general conversation, 20% QA, 20% code, 10% math, 10% creative writing. Adjust proportions based on target use case. Oversample underrepresented categories.
- Filter for quality. Remove duplicates (exact + near-duplicate via MinHash). Filter by response length (drop < 10 tokens). Remove examples with toxic content via classifier. Optionally score with a reward model and keep top 80%.
- Configure SFTTrainer with packing.
Packing concatenates multiple short examples into one sequence to fillfrom trl import SFTTrainer, SFTConfig config = SFTConfig( output_dir="./sft-output", max_seq_length=2048, packing=True, num_train_epochs=2, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-5, lr_scheduler_type="cosine", warmup_ratio=0.03, bf16=True, logging_steps=10, save_strategy="epoch" ) trainer = SFTTrainer(model=model, args=config, train_dataset=dataset, data_collator=collator, tokenizer=tokenizer)
, improving GPU utilization.max_seq_length - Evaluate. Measure held-out loss, MT-Bench score, AlpacaEval win rate, or custom instruction-following eval. Compare against base model and prior SFT checkpoints.
Decision rules
- Use response-only loss masking for all instruction tuning — training on user prompts wastes capacity and can cause parroting.
- Enable packing when average example length < 50% of
.max_seq_length - If the model struggles with multi-turn, increase the proportion of multi-turn conversation data.
- Learning rate 2e-5 for full SFT, 2e-4 if combining with LoRA (delegate to
skill for adapter config).fine-tuning - Minimum dataset: 10k high-quality instruction pairs; 50k+ for general-purpose chat models.
- If perplexity on a held-out general set degrades by >5%, reduce training data volume or add regularization.
Output requirements
— format, cleaning steps, mixing ratios, final dataset sizeData pipeline
— SFTConfig params, chat template, loss masking setup, packing configTraining config
— loss curves, learning rate schedule, gradient normsTraining logs
— held-out loss, MT-Bench/AlpacaEval scores, qualitative examplesEvaluation report
— saved model/adapter, tokenizer, training config YAMLModel artifacts
References
- TRL SFTTrainer: https://huggingface.co/docs/trl/sft_trainer
- Alpaca dataset format: https://github.com/tatsu-lab/stanford_alpaca
- ShareGPT format: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered
- MT-Bench: Zheng et al., "Judging LLM-as-a-Judge" (arXiv:2306.05685)
- LIMA: Zhou et al., "LIMA: Less Is More for Alignment" (arXiv:2305.11206)
Related skills
fine-tuningpreference-optimizationsynthetic-data-generationeval-dataset-design
Failure handling
- If training loss plateaus immediately, verify chat template produces correctly tokenized sequences — print a decoded example to confirm.
- If the model outputs empty or repeated tokens, check that the loss mask is not accidentally masking all tokens (log the fraction of unmasked tokens per batch).
- If multi-turn performance is poor, ensure conversation history is included in the input, not just the last user turn.