Claude-skill-registry grpo
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/grpo" ~/.claude/skills/majiayu000-claude-skill-registry-grpo && rm -rf "$T"
manifest:
skills/data/grpo/SKILL.mdsource content
Group Relative Policy Optimization (GRPO)
Overview
GRPO is a reinforcement learning method for LLM alignment. It generates multiple completions per prompt, scores them with a reward function, and optimizes the policy to favor higher-reward responses using relative policy gradients. This skill includes patterns for training thinking/reasoning models.
Quick Reference
| Component | Purpose |
|---|---|
| RL trainer for policy optimization |
| Training hyperparameters |
| Reward function(s) for scoring |
| Token IDs passed to reward functions (no re-tokenization) |
| KL penalty coefficient (0.1 typical) |
| Completions per prompt (2-4) |
| 1e-5 (10x lower than SFT) |
| Token ID 151668 | boundary for Qwen3-Thinking models |
Critical Environment Setup
import os from dotenv import load_dotenv load_dotenv() # Force text-based progress in Jupyter os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Set BEFORE importing unsloth/TRL os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16'
Critical Import Order
# CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastLanguageModel, is_bf16_supported # Then TRL imports from trl import GRPOConfig, GRPOTrainer from datasets import Dataset import torch
Warning: Setting
ACCELERATE_MIXED_PRECISION after imports may cause training issues.
GRPO Concepts
How GRPO Works
- Generate multiple completions for each prompt
- Score completions with reward function(s)
- Compute relative advantages within each group
- Update policy to favor higher-reward completions
- Apply KL penalty to prevent divergence from reference
Key Differences from PPO
| Aspect | GRPO | PPO |
|---|---|---|
| Baseline | Group relative | Value function |
| Critic | Not needed | Required |
| Memory | Lower | Higher |
| Stability | Good | Can be unstable |
Setup
Load Model
from unsloth import FastLanguageModel # Standard model model, tokenizer = FastLanguageModel.from_pretrained( "unsloth/Qwen3-4B-unsloth-bnb-4bit", max_seq_length=512, load_in_4bit=True, ) # Thinking model (for reasoning tasks) model, tokenizer = FastLanguageModel.from_pretrained( "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit", max_seq_length=1024, # Increased for thinking content load_in_4bit=True, ) # Setup pad token (required for GRPO) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id
Apply LoRA
model = FastLanguageModel.get_peft_model( model, r=16, lora_alpha=16, lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], use_gradient_checkpointing="unsloth", )
Dataset Format
# GRPO requires prompts only (completions generated during training) dataset = Dataset.from_dict({ "prompt": [ tokenizer.apply_chat_template( [{"role": "user", "content": "What is recursion?"}], tokenize=False, add_generation_prompt=True ), # ... more prompts ] })
Reward Functions
Simple Reward Function
def length_reward(completions, prompts=None): """Reward based on response length.""" rewards = [] for completion in completions: length = len(completion.split()) if length < 5: rewards.append(-1.0) elif length < 50: rewards.append(1.0) else: rewards.append(0.5) return rewards
LLM-as-Judge Reward
def llm_judge_reward(completions, prompts): """Use another LLM to score responses.""" rewards = [] for prompt, completion in zip(prompts, completions): score = judge_model.evaluate(prompt, completion) rewards.append(score) return rewards
Rule-Based Reward
def format_reward(completions, prompts=None): """Reward proper formatting.""" rewards = [] for completion in completions: score = 0.0 if completion.endswith("."): score += 0.5 if not completion.startswith(" "): score += 0.5 rewards.append(score) return rewards
Composite Rewards
def combined_reward(completions, prompts): """Combine multiple reward signals.""" length_scores = length_reward(completions) format_scores = format_reward(completions) return [0.5 * l + 0.5 * f for l, f in zip(length_scores, format_scores)]
Thinking-Aware Reward Function (Token-Based)
Use
completion_ids parameter from TRL for efficient token-based parsing:
THINK_END_TOKEN_ID = 151668 # </think> token for Qwen3-Thinking models def thinking_reward_fn(completions, prompts=None, completion_ids=None, **kwargs): """ Token-based reward function using completion_ids provided by TRL. Benefits over string matching: - No re-tokenization overhead (faster training) - Exact token boundaries (no regex edge cases) - Consistent with inference code pattern Scoring: - No </think> token: -1.0 (strongly penalized) - Short thinking (<10 tokens): 0.3 - Medium thinking (10-30 tokens): 0.7 - Long thinking (>30 tokens): 1.0 - Bonus +0.1 for self-questioning (contains '?') """ rewards = [] for completion, comp_ids in zip(completions, completion_ids): # Token-based detection using </think> token ID if THINK_END_TOKEN_ID in comp_ids: end_idx = comp_ids.index(THINK_END_TOKEN_ID) thinking_length = end_idx # Token count before </think> # String-based content analysis for question detection thinking_content = completion.split('</think>')[0] has_self_questions = '?' in thinking_content # Score based on thinking token count if thinking_length < 10: reward = 0.3 # Minimal thinking elif thinking_length < 30: reward = 0.7 + (0.1 if has_self_questions else 0) else: reward = 1.0 + (0.1 if has_self_questions else 0) else: reward = -1.0 # No </think> token found rewards.append(reward) return rewards
Key insight: TRL passes
completion_ids directly to reward functions, eliminating re-tokenization overhead.
Multi-Objective Thinking Reward (Token-Based)
THINK_END_TOKEN_ID = 151668 # </think> token for Qwen3-Thinking models def comprehensive_thinking_reward(completions, prompts=None, completion_ids=None, **kwargs): """ Evaluate multiple aspects of thinking quality using token IDs. Scoring breakdown: - Has </think> token: +0.3 - Thinking depth (20+ tokens): +0.3 - Structured sentences: +0.2 - Self-questioning: +0.1 - Step-by-step reasoning: +0.1 """ rewards = [] for completion, comp_ids in zip(completions, completion_ids): score = 0.0 # Token-based boundary detection if THINK_END_TOKEN_ID in comp_ids: score += 0.3 # Has proper </think> token end_idx = comp_ids.index(THINK_END_TOKEN_ID) thinking_length = end_idx # Token count # Extract thinking content for text analysis thinking = completion.split('</think>')[0] # Depth (token count from IDs) if thinking_length >= 20: score += 0.3 elif thinking_length >= 10: score += 0.2 # Structure (sentences in text) sentences = thinking.count('.') + thinking.count('!') if sentences >= 2: score += 0.2 # Self-questioning if '?' in thinking: score += 0.1 # Step-by-step reasoning if any(w in thinking.lower() for w in ['first', 'then', 'next', 'finally']): score += 0.1 else: score = -0.5 # Penalize missing </think> token rewards.append(score) return rewards
GRPOTrainer Configuration
Basic Configuration
from trl import GRPOConfig grpo_config = GRPOConfig( output_dir="./grpo_output", per_device_train_batch_size=1, gradient_accumulation_steps=4, max_steps=100, learning_rate=1e-5, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), optim="adamw_8bit", max_completion_length=128, num_generations=4, beta=0.1, )
Key Parameters
| Parameter | Typical Values | Effect |
|---|---|---|
| 0.01-0.1 | KL penalty strength |
| 2-8 | Completions per prompt |
| 64-256 | Generation length |
| 1e-6 to 1e-5 | Lower than SFT |
Training
Basic Training Loop
from trl import GRPOTrainer trainer = GRPOTrainer( model=model, args=grpo_config, train_dataset=dataset, processing_class=tokenizer, reward_funcs=length_reward, ) trainer.train()
Multiple Reward Functions
trainer = GRPOTrainer( model=model, args=grpo_config, train_dataset=dataset, processing_class=tokenizer, reward_funcs=[length_reward, format_reward], reward_weights=[0.5, 0.5], )
Troubleshooting
Reward Hacking
Symptom: Model exploits reward function (e.g., always outputs same length)
Fix:
- Add diversity penalties
- Use multiple reward signals
- Cap maximum reward
KL Divergence Too High
Symptom: Policy diverges too far from reference
Fix:
- Increase
(stronger KL penalty)beta - Reduce
learning_rate - Fewer training steps
Training Instability
Symptom: Loss spikes or NaN
Fix:
- Lower
to 5e-6learning_rate - Reduce
to 2num_generations - Check reward scale (should be roughly -1 to 1)
Memory Issues
Symptom: OOM with multiple generations
Fix:
- Reduce
to 2num_generations - Use gradient checkpointing
- Reduce
max_completion_length
Kernel Shutdown (Jupyter)
GRPO training uses significant GPU memory. Shutdown kernel to release memory:
import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)
Important: Always run this at the end of training notebooks before switching to different models.
When to Use This Skill
Use when:
- Aligning models with human preferences
- Optimizing for specific behaviors
- Post-SFT refinement
- Building reward-driven systems
- Simpler alternative to PPO
Cross-References
- Pre-training before GRPObazzite-ai-jupyter:sft
- Simpler preference learning (no reward model)bazzite-ai-jupyter:dpo
- Alternative RL method with lower variancebazzite-ai-jupyter:rloo
- Training reward models for GRPObazzite-ai-jupyter:reward
- LoRA for efficient RLbazzite-ai-jupyter:peft
- Fast inference with vLLMbazzite-ai-jupyter:inference
- Reward model inferencebazzite-ai-ollama:api