Hermes-agent slime-rl-training
Provides guidance for LLM post-training with RL using slime, a Megatron+SGLang framework. Use when training GLM models, implementing custom data generation workflows, or needing tight Megatron-LM integration for RL scaling.
git clone https://github.com/NousResearch/hermes-agent
T=$(mktemp -d) && git clone --depth=1 https://github.com/NousResearch/hermes-agent "$T" && mkdir -p ~/.claude/skills && cp -r "$T/optional-skills/mlops/slime" ~/.claude/skills/nousresearch-hermes-agent-slime-rl-training-08cba0 && rm -rf "$T"
optional-skills/mlops/slime/SKILL.mdslime: LLM Post-Training Framework for RL Scaling
slime is an LLM post-training framework from Tsinghua's THUDM team, powering GLM-4.5, GLM-4.6, and GLM-4.7. It connects Megatron-LM for training with SGLang for high-throughput rollout generation.
When to Use slime
Choose slime when you need:
- Megatron-LM native training with SGLang inference
- Custom data generation workflows with flexible data buffers
- Training GLM, Qwen3, DeepSeek V3, or Llama 3 models
- Research-grade framework with production backing (Z.ai)
Consider alternatives when:
- You need enterprise-grade stability features → use miles
- You want flexible backend swapping → use verl
- You need PyTorch-native abstractions → use torchforge
Key Features
- Training: Megatron-LM with full parallelism support (TP, PP, DP, SP)
- Rollout: SGLang-based high-throughput generation with router
- Data Buffer: Flexible prompt management and sample storage
- Models: GLM-4.x, Qwen3, DeepSeek V3/R1, Llama 3
Architecture Overview
┌─────────────────────────────────────────────────────────┐ │ Data Buffer │ │ - Prompt initialization and management │ │ - Custom data generation and filtering │ │ - Rollout sample storage │ └─────────────┬───────────────────────────┬───────────────┘ │ │ ┌─────────────▼───────────┐ ┌─────────────▼───────────────┐ │ Training (Megatron-LM) │ │ Rollout (SGLang + Router) │ │ - Actor model training │ │ - Response generation │ │ - Critic (optional) │ │ - Reward/verifier output │ │ - Weight sync to rollout│ │ - Multi-turn support │ └─────────────────────────┘ └─────────────────────────────┘
Installation
# Recommended: Docker docker pull slimerl/slime:latest docker run --rm --gpus all --ipc=host --shm-size=16g \ -it slimerl/slime:latest /bin/bash # Inside container cd /root/slime && pip install -e . --no-deps
From Source
git clone https://github.com/THUDM/slime.git cd slime pip install -r requirements.txt pip install -e .
Quick Start: GRPO Training
# Source model configuration source scripts/models/qwen3-4B.sh # Launch training python train.py \ --actor-num-nodes 1 \ --actor-num-gpus-per-node 4 \ --rollout-num-gpus 4 \ --advantage-estimator grpo \ --use-kl-loss --kl-loss-coef 0.001 \ --rollout-batch-size 32 \ --n-samples-per-prompt 8 \ --global-batch-size 256 \ --num-rollout 3000 \ --prompt-data /path/to/data.jsonl \ ${MODEL_ARGS[@]} ${CKPT_ARGS[@]}
Workflow 1: Standard GRPO Training
Use this workflow for training reasoning models with group-relative advantages.
Prerequisites Checklist
- Docker environment or Megatron-LM + SGLang installed
- Model checkpoint (HuggingFace or Megatron format)
- Training data in JSONL format
Step 1: Prepare Data
# data.jsonl format {"prompt": "What is 2 + 2?", "label": "4"} {"prompt": "Solve: 3x = 12", "label": "x = 4"}
Or with chat format:
{ "prompt": [ {"role": "system", "content": "You are a math tutor."}, {"role": "user", "content": "What is 15 + 27?"} ], "label": "42" }
Step 2: Configure Model
Choose a pre-configured model script:
# List available models ls scripts/models/ # glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ... # Source your model source scripts/models/qwen3-4B.sh
Step 3: Launch Training
python train.py \ --actor-num-nodes 1 \ --actor-num-gpus-per-node 8 \ --rollout-num-gpus 8 \ --advantage-estimator grpo \ --use-kl-loss \ --kl-loss-coef 0.001 \ --prompt-data /path/to/train.jsonl \ --input-key prompt \ --label-key label \ --apply-chat-template \ --rollout-batch-size 32 \ --n-samples-per-prompt 8 \ --global-batch-size 256 \ --num-rollout 3000 \ --save-interval 100 \ --eval-interval 50 \ ${MODEL_ARGS[@]}
Step 4: Monitor Training
- Check TensorBoard:
tensorboard --logdir outputs/ - Verify reward curves are increasing
- Monitor GPU utilization across nodes
Workflow 2: Asynchronous Training
Use async mode for higher throughput by overlapping rollout and training.
When to Use Async
- Large models with long generation times
- High GPU idle time in synchronous mode
- Sufficient memory for buffering
Launch Async Training
python train_async.py \ --actor-num-nodes 1 \ --actor-num-gpus-per-node 8 \ --rollout-num-gpus 8 \ --advantage-estimator grpo \ --async-buffer-size 4 \ --prompt-data /path/to/train.jsonl \ ${MODEL_ARGS[@]}
Async-Specific Parameters
--async-buffer-size 4 # Number of rollouts to buffer --update-weights-interval 2 # Sync weights every N rollouts
Workflow 3: Multi-Turn Agentic Training
Use this workflow for training agents with tool use or multi-step reasoning.
Prerequisites
- Custom generate function for multi-turn logic
- Tool/environment interface
Step 1: Define Custom Generate Function
# custom_generate.py async def custom_generate(args, samples, evaluation=False): """Multi-turn generation with tool calling.""" for sample in samples: conversation = sample.prompt for turn in range(args.max_turns): # Generate response response = await generate_single(conversation) # Check for tool call tool_call = extract_tool_call(response) if tool_call: tool_result = execute_tool(tool_call) conversation.append({"role": "assistant", "content": response}) conversation.append({"role": "tool", "content": tool_result}) else: break sample.response = response sample.reward = compute_reward(sample) return samples
Step 2: Launch with Custom Function
python train.py \ --custom-generate-function-path custom_generate.py \ --max-turns 5 \ --prompt-data /path/to/agent_data.jsonl \ ${MODEL_ARGS[@]}
See
examples/search-r1/ for a complete multi-turn search example.
Configuration Reference
Three Argument Categories
slime uses three types of arguments:
1. Megatron Arguments (passed directly):
--tensor-model-parallel-size 2 --pipeline-model-parallel-size 1 --num-layers 32 --hidden-size 4096
2. SGLang Arguments (prefixed with
--sglang-):
--sglang-mem-fraction-static 0.8 --sglang-context-length 8192 --sglang-log-level INFO
3. slime Arguments:
# Resource allocation --actor-num-nodes 1 --actor-num-gpus-per-node 8 --rollout-num-gpus 8 --colocate # Share GPUs between training/inference # Data --prompt-data /path/to/data.jsonl --input-key prompt --label-key label # Training loop --num-rollout 3000 --rollout-batch-size 32 --n-samples-per-prompt 8 --global-batch-size 256 # Algorithm --advantage-estimator grpo # or: gspo, ppo, reinforce_plus_plus --use-kl-loss --kl-loss-coef 0.001
Key Constraints
rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
Example: 32 × 8 = 256 × 1
Data Buffer System
slime's data buffer enables flexible data management:
Basic Data Source
class RolloutDataSource: def get_samples(self, num_samples): """Fetch prompts from dataset.""" return self.dataset.sample(num_samples) def add_samples(self, samples): """Called after generation (no-op by default).""" pass
Buffered Data Source (Off-Policy)
class RolloutDataSourceWithBuffer(RolloutDataSource): def __init__(self): self.buffer = [] def add_samples(self, samples): """Store generated samples for reuse.""" self.buffer.extend(samples) def buffer_filter(self, args, buffer, num_samples): """Custom selection logic (prioritized, stratified, etc.).""" return select_best(buffer, num_samples)
Common Issues and Solutions
Issue: SGLang Engine Crash
Symptoms: Inference engine dies mid-training
Solutions:
# Enable fault tolerance --use-fault-tolerance # Increase memory allocation --sglang-mem-fraction-static 0.85 # Reduce batch size --rollout-batch-size 16
Issue: Weight Sync Timeout
Symptoms: Training hangs after rollout
Solutions:
# Increase sync interval --update-weights-interval 5 # Use colocated mode (no network transfer) --colocate
Issue: OOM During Training
Symptoms: CUDA OOM in backward pass
Solutions:
# Enable gradient checkpointing --recompute-activations # Reduce micro-batch size --micro-batch-size 1 # Enable sequence parallelism --sequence-parallel
Issue: Slow Data Loading
Symptoms: GPU idle during data fetch
Solutions:
# Increase data workers --num-data-workers 4 # Use streaming dataset --streaming-data
Supported Models
| Model Family | Configurations |
|---|---|
| GLM | GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B |
| Qwen | Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5 |
| DeepSeek | V3, V3.1, R1 |
| Llama | Llama 3 (8B, 70B) |
| Others | Kimi K2, Moonlight-16B |
Each model has pre-configured scripts in
scripts/models/.
Advanced Topics
Co-location Mode
Share GPUs between training and inference to reduce memory:
python train.py \ --colocate \ --actor-num-gpus-per-node 8 \ --sglang-mem-fraction-static 0.4 \ ${MODEL_ARGS[@]}
Custom Reward Model
# custom_rm.py class CustomRewardModel: def __init__(self, model_path): self.model = load_model(model_path) def compute_reward(self, prompts, responses): inputs = self.tokenize(prompts, responses) scores = self.model(inputs) return scores.tolist()
--custom-rm-path custom_rm.py
Evaluation Multi-Task
--eval-prompt-data aime /path/to/aime.jsonl \ --eval-prompt-data gsm8k /path/to/gsm8k.jsonl \ --n-samples-per-eval-prompt 16
Resources
- Documentation: https://thudm.github.io/slime/
- GitHub: https://github.com/THUDM/slime
- Blog: https://lmsys.org/blog/2025-07-09-slime/
- Examples: See
directory for 14+ worked examplesexamples/