Claude-code-templates torchforge-rl-training
Provides guidance for PyTorch-native agentic RL using torchforge, Meta's library separating infra from algorithms. Use when you want clean RL abstractions, easy algorithm experimentation, or scalable training with Monarch and TorchTitan.
git clone https://github.com/davila7/claude-code-templates
T=$(mktemp -d) && git clone --depth=1 https://github.com/davila7/claude-code-templates "$T" && mkdir -p ~/.claude/skills && cp -r "$T/cli-tool/components/skills/ai-research/post-training-torchforge" ~/.claude/skills/davila7-claude-code-templates-torchforge-rl-training && rm -rf "$T"
cli-tool/components/skills/ai-research/post-training-torchforge/SKILL.mdtorchforge: PyTorch-Native Agentic RL Library
torchforge is Meta's PyTorch-native RL library that separates infrastructure concerns from algorithm concerns. It enables rapid RL research by letting you focus on algorithms while handling distributed training, inference, and weight sync automatically.
When to Use torchforge
Choose torchforge when you need:
- Clean separation between RL algorithms and infrastructure
- PyTorch-native abstractions (no Ray dependency)
- Easy algorithm experimentation (GRPO, DAPO, SAPO in ~100 lines)
- Scalable training with Monarch actor system
- Integration with TorchTitan for model parallelism
Consider alternatives when:
- You need production-ready stability → use miles or verl
- You want Megatron-native training → use slime
- torchforge is experimental and APIs may change
Key Features
- Algorithm isolation: Implement RL algorithms without touching infrastructure
- Scalability: From single GPU to thousands via Monarch
- Modern stack: TorchTitan (training), vLLM (inference), TorchStore (sync)
- Loss functions: GRPO, DAPO, CISPO, GSPO, SAPO built-in
Architecture Overview
┌─────────────────────────────────────────────────────────┐ │ Application Layer (Your Code) │ │ - Define reward models, loss functions, sampling │ └─────────────────────┬───────────────────────────────────┘ │ ┌─────────────────────▼───────────────────────────────────┐ │ Forge API Layer │ │ - Episode, Group dataclasses │ │ - Service interfaces (async/await) │ └─────────────────────┬───────────────────────────────────┘ │ ┌─────────────────────▼───────────────────────────────────┐ │ Distributed Services (Monarch) │ │ ├── Trainer (TorchTitan FSDP) │ │ ├── Generator (vLLM inference) │ │ ├── Reference Model (frozen KL baseline) │ │ └── Reward Actors (compute rewards) │ └─────────────────────────────────────────────────────────┘
Installation
# Create environment conda create -n forge python=3.12 conda activate forge # Install (handles PyTorch nightly + dependencies) ./scripts/install.sh # Verify python -c "import torch, forge, vllm; print('OK')"
ROCm Installation
./scripts/install_rocm.sh
Quick Start
SFT Training (2+ GPUs)
python -m apps.sft.main --config apps/sft/llama3_8b.yaml
GRPO Training (3+ GPUs)
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
Workflow 1: GRPO Training for Math Reasoning
Use this workflow for training reasoning models with group-relative advantages.
Prerequisites Checklist
- 3+ GPUs (GPU0: trainer, GPU1: ref_model, GPU2: generator)
- Model from HuggingFace Hub
- Training dataset (GSM8K, MATH, etc.)
Step 1: Create Configuration
# config/grpo_math.yaml model: "Qwen/Qwen2.5-7B-Instruct" dataset: path: "openai/gsm8k" split: "train" streaming: true training: batch_size: 4 learning_rate: 1e-6 seq_len: 4096 dtype: bfloat16 gradient_accumulation_steps: 4 grpo: n_samples: 8 # Responses per prompt clip_low: 0.2 clip_high: 0.28 beta: 0.1 # KL penalty coefficient temperature: 0.7 services: generator: procs: 1 num_replicas: 1 with_gpus: true trainer: procs: 1 num_replicas: 1 with_gpus: true ref_model: procs: 1 num_replicas: 1 with_gpus: true
Step 2: Define Reward Function
# rewards.py # Reward functions are in forge.data.rewards from forge.data.rewards import MathReward, ThinkingReward import re # Or define your own reward function class CustomMathReward: def __call__(self, prompt: str, response: str, target: str) -> float: # Extract answer from response match = re.search(r'\\boxed{([^}]+)}', response) if not match: return 0.0 answer = match.group(1).strip() return 1.0 if answer == target else 0.0
Step 3: Launch Training
python -m apps.grpo.main --config config/grpo_math.yaml
Step 4: Monitor Progress
- Check W&B dashboard for loss curves
- Verify entropy is decreasing (policy becoming more deterministic)
- Monitor KL divergence (should stay bounded)
Workflow 2: Custom Loss Function
Use this workflow to implement new RL algorithms.
Step 1: Create Loss Class
# src/forge/losses/custom_loss.py import torch import torch.nn as nn class CustomLoss(nn.Module): def __init__(self, clip_range: float = 0.2, beta: float = 0.1): super().__init__() self.clip_range = clip_range self.beta = beta def forward( self, logprobs: torch.Tensor, ref_logprobs: torch.Tensor, advantages: torch.Tensor, padding_mask: torch.Tensor, ) -> torch.Tensor: # Compute importance ratio ratio = torch.exp(logprobs - ref_logprobs) # Clipped policy gradient clipped_ratio = torch.clamp( ratio, 1 - self.clip_range, 1 + self.clip_range ) pg_loss = -torch.min(ratio * advantages, clipped_ratio * advantages) # KL penalty kl = ref_logprobs - logprobs # Apply mask and aggregate masked_loss = (pg_loss + self.beta * kl) * padding_mask loss = masked_loss.sum() / padding_mask.sum() return loss
Step 2: Integrate into Application
# apps/custom/main.py from forge.losses.custom_loss import CustomLoss loss_fn = CustomLoss(clip_range=0.2, beta=0.1) # In training loop loss = loss_fn( logprobs=logprobs, ref_logprobs=ref_logprobs, advantages=advantages, padding_mask=padding_mask, )
Workflow 3: Multi-GPU Distributed Training
Use this workflow for scaling to multiple GPUs or nodes.
Configuration for Distributed
# config/distributed.yaml model: "meta-llama/Meta-Llama-3.1-8B-Instruct" parallelism: tensor_parallel_degree: 2 # Split model across GPUs pipeline_parallel_degree: 1 data_parallel_shard_degree: 2 services: generator: procs: 2 # 2 processes for TP=2 num_replicas: 1 with_gpus: true trainer: procs: 2 num_replicas: 1 with_gpus: true
Launch with SLURM
# Submit job sbatch --nodes=2 --gpus-per-node=8 run_grpo.sh
Launch Locally (Multi-GPU)
# 8 GPU setup python -m apps.grpo.main \ --config config/distributed.yaml \ --trainer.procs 4 \ --generator.procs 4
Core API Reference
Training Batch Format
torchforge uses dictionary-based batches for training:
# inputs: list of dicts with torch.Tensor values inputs = [{"tokens": torch.Tensor}] # targets: list of dicts with training signals targets = [{ "response": torch.Tensor, "ref_logprobs": torch.Tensor, "advantages": torch.Tensor, "padding_mask": torch.Tensor }] # train_step returns loss as float loss = trainer.train_step(inputs, targets)
Completion
Generated output from vLLM:
@dataclass class Completion: text: str # Generated text token_ids: list[int] # Token IDs logprobs: list[float] # Log probabilities metadata: dict # Custom metadata
Built-in Loss Functions
Loss Functions
Loss functions are in the
forge.losses module:
from forge.losses import SimpleGRPOLoss, ReinforceLoss # SimpleGRPOLoss for GRPO training loss_fn = SimpleGRPOLoss(beta=0.1) # Forward pass loss = loss_fn( logprobs=logprobs, ref_logprobs=ref_logprobs, advantages=advantages, padding_mask=padding_mask )
ReinforceLoss
from forge.losses.reinforce_loss import ReinforceLoss # With optional importance ratio clipping loss_fn = ReinforceLoss(clip_ratio=0.2)
Common Issues and Solutions
Issue: Not Enough GPUs
Symptoms: "Insufficient GPU resources" error
Solutions:
# Reduce service requirements services: generator: procs: 1 with_gpus: true trainer: procs: 1 with_gpus: true # Remove ref_model (uses generator weights)
Or use CPU for reference model:
ref_model: with_gpus: false
Issue: OOM During Generation
Symptoms: CUDA OOM in vLLM
Solutions:
# Reduce batch size grpo: n_samples: 4 # Reduce from 8 # Or reduce sequence length training: seq_len: 2048
Issue: Slow Weight Sync
Symptoms: Long pauses between training and generation
Solutions:
# Enable RDMA (if available) export TORCHSTORE_USE_RDMA=1 # Or reduce sync frequency training: sync_interval: 10 # Sync every 10 steps
Issue: Policy Collapse
Symptoms: Entropy drops to zero, reward stops improving
Solutions:
# Increase KL penalty grpo: beta: 0.2 # Increase from 0.1 # Or add entropy bonus training: entropy_coef: 0.01
Resources
- Documentation: https://meta-pytorch.org/torchforge
- GitHub: https://github.com/meta-pytorch/torchforge
- Discord: https://discord.gg/YsTYBh6PD9
- TorchTitan: https://github.com/pytorch/torchtitan
- Monarch: https://github.com/meta-pytorch/monarch