Skilllibrary preference-optimization
Implement preference-based alignment using DPO, IPO, KTO, or ORPO with TRL's DPOTrainer. Covers preference data formatting (prompt/chosen/rejected), beta tuning, loss variants, and reference model management. Use when aligning a fine-tuned LLM with human or AI preferences. Do not use for supervised fine-tuning, reward model training, or PPO-based RLHF.
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/12-ai-llm-training-architecture-and-research/preference-optimization" ~/.claude/skills/merceralex397-collab-skilllibrary-preference-optimization && rm -rf "$T"
manifest:
12-ai-llm-training-architecture-and-research/preference-optimization/SKILL.mdsource content
Purpose
Align language models with human preferences using Direct Preference Optimization (DPO) and related methods via TRL's DPOTrainer, covering data preparation, loss configuration, beta tuning, and evaluation of alignment quality.
When to use this skill
Use this skill when:
- Setting up DPO training:
from trl import DPOTrainer, DPOConfig - Configuring DPO:
DPOConfig(beta=0.1, loss_type="sigmoid", learning_rate=5e-7, max_length=1024) - Preparing preference datasets in
format{"prompt": ..., "chosen": ..., "rejected": ...} - Choosing between DPO variants: standard DPO, IPO (bounded loss), KTO (unpaired preferences), ORPO (odds-ratio)
- Managing reference model: frozen copy vs implicit reference via LoRA
- Deciding DPO vs full RLHF (PPO with reward model) for a given alignment task
Do not use this skill when
- The task is supervised fine-tuning on instruction data — use
orinstruction-tuningfine-tuning - The task is training a reward model for PPO-based RLHF — use
reward-modeling - The task is about generating synthetic preference data — use
synthetic-data-generation - The task has no alignment or preference learning component
Operating procedure
- Prepare preference data: Format dataset as
. Each row is one comparison. Source from human annotations, AI feedback (Constitutional AI), or ranked model outputs. Ensure chosen/rejected share the same prompt. Load with{"prompt": str, "chosen": str, "rejected": str}
.datasets.load_dataset() - Initialize models: Load the SFT model as the policy. For standard DPO, create a frozen reference copy:
. With LoRA/QLoRA, omitref_model = AutoModelForCausalLM.from_pretrained(model_path)
— TRL uses the base LoRA-free model as implicit reference.ref_model - Configure DPO training:
from trl import DPOTrainer, DPOConfig training_args = DPOConfig( beta=0.1, # KL penalty strength loss_type="sigmoid", # standard DPO loss learning_rate=5e-7, # lower than SFT LR per_device_train_batch_size=4, gradient_accumulation_steps=4, max_length=1024, max_prompt_length=512, num_train_epochs=1, # 1-3 epochs typical ) - Initialize trainer:
. Calltrainer = DPOTrainer(model=model, ref_model=ref_model, args=training_args, train_dataset=dataset, tokenizer=tokenizer)
.trainer.train() - Monitor training: Track
(should increase),rewards/chosen
(should decrease),rewards/rejected
(should widen), andrewards/margins
(fraction where chosen scores higher). Log with wandb.rewards/accuracies - Evaluate alignment: Test on held-out preference pairs. Run MT-Bench or AlpacaEval for general chat quality. Check for over-optimization: if
grows but eval quality drops, reduce beta or stop early.rewards/margins
Decision rules
- DPO (
): Default choice. Direct optimization ofloss_type="sigmoid"
. Simple, stable, well-understood.L = -log(σ(β * (log π(y_w|x)/π_ref(y_w|x) - log π(y_l|x)/π_ref(y_l|x)))) - IPO (
): Adds regularization to prevent overfitting to preference noise. Use when data is noisy or small (<10k pairs).loss_type="ipo" - KTO (Kahneman-Tversky): Works with unpaired data (only "good" or "bad" examples, not paired comparisons). Use when paired preference data is expensive to collect.
- ORPO: Combines SFT and preference optimization in one step, no reference model needed. Use to reduce training pipeline complexity.
is the standard starting point. Lower beta (0.01-0.05) for weaker alignment; higher (0.2-0.5) for stronger but riskier constraint.beta=0.1- Use
5-10x lower than SFT LR. DPO is sensitive to high LR — causes reward hacking.learning_rate - 1 epoch is usually sufficient. Multiple epochs risk overfitting to preference data.
- DPO is preferred over PPO-RLHF when: no reward model exists, compute is limited, or simplicity is valued. PPO-RLHF is preferred when: fine-grained reward shaping is needed or the reward signal is complex.
Output requirements
— CompleteDPO Config
with beta, loss_type, LR, batch size, max lengthsDPOConfig
— Preference dataset format, size, source (human/AI), and quality checks appliedData Spec
— Reward margins, chosen/rejected reward curves, training loss, accuracy on held-out pairsTraining Metrics
— MT-Bench scores, AlpacaEval win rates, or task-specific alignment benchmarksAlignment Evaluation
References
- Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (arxiv 2305.18290)
- Azar et al., "A General Theoretical Paradigm to Understand Learning from Human Feedback" — IPO (arxiv 2310.12036)
- Ethayarajh et al., "KTO: Model Alignment as Prospect Theoretic Optimization" (arxiv 2402.01306)
- Hong et al., "ORPO: Monolithic Preference Optimization without Reference Model" (arxiv 2403.07691)
- TRL documentation: https://huggingface.co/docs/trl/dpo_trainer
— HuggingFace TRL libraryfrom trl import DPOTrainer, DPOConfig
Related skills
— SFT stage that precedes DPO in the standard alignment pipelinefine-tuning
— alternative path using explicit reward models with PPOreward-modeling
— broader alignment goals that DPO servessafety-alignment
— quality of preference data directly impacts DPO outcomesdataset-curation
Failure handling
- If
plateau near zero after 100+ steps, the model is not learning preferences — check data quality, increase beta, or verify chosen/rejected are meaningfully different.rewards/margins - If
hits 1.0 early in training, the preference pairs are too easy — add harder near-boundary comparisons.rewards/accuracies - If eval quality degrades while training metrics improve, this is reward over-optimization — reduce beta, stop training earlier, or use IPO.
- If OOM with full reference model, switch to LoRA-based DPO (implicit reference) or use
with ORPO.ref_model=None