Skilllibrary reward-modeling
Trains and evaluates reward models that score LLM outputs for RLHF pipelines using Bradley-Terry preference modeling and TRL's RewardTrainer. Use when building a scalar reward signal from pairwise human preferences, diagnosing reward hacking, or choosing between process and outcome reward models. Do not use for the RL policy optimization step itself.
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/12-ai-llm-training-architecture-and-research/reward-modeling" ~/.claude/skills/merceralex397-collab-skilllibrary-reward-modeling && rm -rf "$T"
manifest:
12-ai-llm-training-architecture-and-research/reward-modeling/SKILL.mdsource content
Purpose
Train a scalar reward model from pairwise human preference data to provide the training signal for RLHF or best-of-N sampling. The reward model takes a prompt + completion and outputs a single scalar score. It is trained so that preferred completions score higher than rejected ones under the Bradley-Terry model.
When to use this skill
Use this skill when:
- training a reward model from pairwise preference data (chosen vs rejected completions)
- implementing the Bradley-Terry loss:
-log(sigmoid(r_chosen - r_rejected)) - configuring TRL's
or building a custom reward training loopRewardTrainer - diagnosing reward hacking or overoptimization in an RLHF pipeline
- choosing between process reward models (PRM) and outcome reward models (ORM)
- evaluating reward model agreement with held-out human preferences
Do not use this skill when
- the task is the PPO/DPO policy optimization step — prefer
preference-optimization - the task is building safety-specific reward signals — prefer
safety-alignment - the task is designing the preference annotation interface or guidelines — prefer
eval-dataset-design
Operating procedure
- Prepare preference data. Each example must have:
,prompt
completion,chosen
completion. Format as HuggingFace Dataset with columns matching TRL expectations. Ensure annotator agreement ≥70% on a validation split.rejected - Initialize the reward model. Start from the same base model (or SFT checkpoint) that the policy will use. Add a value head that projects the final hidden state to a scalar:
from trl import AutoModelForCausalLMWithValueHead reward_model = AutoModelForCausalLMWithValueHead.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") - Configure training. Use
to set learning rate (1e-5 to 5e-6), max sequence length, and gradient accumulation:RewardConfigfrom trl import RewardTrainer, RewardConfig config = RewardConfig( output_dir="./reward_model", per_device_train_batch_size=4, num_train_epochs=1, learning_rate=1.5e-5, max_length=2048, ) trainer = RewardTrainer( model=reward_model, args=config, train_dataset=train_dataset, processing_class=tokenizer, ) trainer.train() - Apply the Bradley-Terry loss. The default TRL loss is:
. This maximizes the margin between chosen and rejected scores. Monitor the mean margin during training — it should increase steadily.loss = -log(sigmoid(r_chosen - r_rejected)) - Evaluate the reward model.
- Agreement: Measure accuracy on held-out preference pairs (target: >70%).
- Reward distribution: Plot score histograms for chosen vs rejected — distributions should separate clearly.
- Calibration: Check that reward differences correlate with annotator confidence levels.
- Guard against reward hacking. When used in RLHF, the policy will exploit reward model weaknesses. Mitigations:
- Apply a KL penalty:
with β ∈ [0.01, 0.1].reward_total = reward - β * KL(policy || reference) - Use reward model ensembles and take the conservative (minimum) score.
- Monitor output length — reward hacking often manifests as length gaming.
- Apply a KL penalty:
- Choose PRM vs ORM. Use outcome reward models (ORM) for general preference alignment. Use process reward models (PRM) for step-by-step reasoning tasks (math, code) where intermediate steps can be scored individually.
Decision rules
- Train for only 1 epoch on preference data to avoid overfitting — reward models overfit quickly.
- If agreement on held-out data is <65%, the preference data quality is insufficient; collect more annotations before proceeding.
- Use a base model size ≥ policy model size for the reward model when possible (larger RMs are more robust).
- Prefer margin-based evaluation (mean reward gap) over raw accuracy for tracking training progress.
- If the reward model assigns systematically higher scores to longer outputs, add a length penalty or normalize by length.
Output requirements
— saved model with value head, tokenizer, and training configReward Model Checkpoint
— held-out accuracy, mean chosen/rejected margin, reward distribution plotsEvaluation Report
— preference dataset statistics: size, source, annotator agreement, domain coverageData Summary
— how to load the reward model for PPO training or best-of-N samplingIntegration Notes
References
Read these only when relevant:
- TRL RewardTrainer docs: https://huggingface.co/docs/trl/reward_trainer
- Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT)
- Stiennon et al., "Learning to summarize from human feedback"
- Lightman et al., "Let's Verify Step by Step" (process reward models)
Related skills
— PPO/DPO policy training that consumes the reward modelpreference-optimization
— safety-specific reward signals and refusal trainingsafety-alignment
— designing preference annotation taskseval-dataset-design
— generating synthetic preference pairssynthetic-data-generation
Failure handling
- If reward model accuracy plateaus below 65%, audit the preference data for label noise, contradictory pairs, or insufficient prompt diversity.
- If reward scores collapse to a narrow range, reduce learning rate and check for gradient issues in the value head.
- If the policy exploits the reward model during RLHF, increase the KL penalty β or retrain the RM with adversarial examples from the exploiting policy.
- If PRM training fails to improve step-level accuracy, verify that step-level annotations are consistent and that the granularity matches the model's generation pattern.