git clone https://github.com/ComeOnOliver/skillshub
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/Orchestra-Research/AI-Research-SKILLs/miles" ~/.claude/skills/comeonoliver-skillshub-miles && rm -rf "$T"
skills/Orchestra-Research/AI-Research-SKILLs/miles/SKILL.mdmiles: Enterprise-Grade RL for Large-Scale Model Training
miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Built as a production fork of slime, it addresses critical challenges in MoE training stability, low-precision training, and train-inference alignment.
When to Use miles
Choose miles when you need:
- Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)
- FP8 or INT4 quantization-aware training
- Bit-wise identical train-inference alignment
- Speculative RL for maximum throughput
- Production stability with enterprise support
Consider alternatives when:
- You want the research-grade original → use slime
- You need flexible backend swapping → use verl
- You want PyTorch-native abstractions → use torchforge
Key Features
Low-Precision Training
- Unified FP8: End-to-end FP8 for both inference and training
- INT4 QAT: 1TB models on single-machine VRAM (H200)
- Rollout Routing Replay (R3): Bit-wise expert alignment for MoE
Performance Optimizations
- Speculative RL: 25%+ rollout speedup with online SFT draft models
- Zero-Copy Weight Sync: CUDA IPC zero-copy mapping
- Partial Rollout: Recycle half-finished trajectories
Train-Inference Alignment
- TIS/MIS: Truncated/Masked Importance Sampling for off-policy correction
- Kernel-level optimization: FlashAttention-3, DeepGEMM integration
Installation
# Recommended: Docker docker pull radixark/miles:latest docker run --rm --gpus all --ipc=host --shm-size=16g \ -it radixark/miles:latest /bin/bash # From source git clone https://github.com/radixark/miles.git cd miles pip install -r requirements.txt pip install -e .
Quick Start
miles inherits slime's configuration system. Basic training:
python train.py \ --advantage-estimator grpo \ --model-name qwen3-30b-a3b \ --hf-checkpoint /path/to/qwen3-30b-a3b-hf \ --rollout-batch-size 512 \ --n-samples-per-prompt 8
Workflow 1: Large MoE Training
Use this workflow for training large MoE models like DeepSeek V3 or Qwen3-MoE.
Prerequisites Checklist
- H100/H200 GPUs with FP8 support
- MoE model (DeepSeek V3, Qwen3-MoE)
- Docker environment with miles
Step 1: Environment Setup
# FP8 block scaling (recommended for stability) export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1 export CUDA_DEVICE_MAX_CONNECTIONS=1
Step 2: Configure Training
python train.py \ --actor-num-gpus-per-node 8 \ --rollout-num-gpus 8 \ --hf-checkpoint /path/to/deepseek-v3 \ --advantage-estimator grpo \ --tensor-model-parallel-size 8 \ --expert-model-parallel-size 4 \ --prompt-data /path/to/data.jsonl \ --num-rollout 3000
Verification Checklist
- Model loads without errors
- Routing decisions are consistent
- No NaN/Inf in loss values
Workflow 2: Speculative RL Training
Use this workflow for maximum rollout throughput with EAGLE speculative decoding.
How Speculative RL Works
- Small draft model generates candidate tokens
- Target model verifies in parallel
- Draft model updated via online SFT to track policy
Step 1: Enable Speculative Decoding
miles supports EAGLE speculative decoding via SGLang:
python train.py \ --actor-num-gpus-per-node 8 \ --hf-checkpoint /path/to/target-model \ --sglang-speculative-algorithm EAGLE \ --sglang-speculative-num-steps 3 \ --sglang-speculative-eagle-topk 1 \ --sglang-speculative-num-draft-tokens 4 \ --sglang-speculative-draft-model-path /path/to/draft-model \ --advantage-estimator grpo \ --prompt-data /path/to/data.jsonl
Step 2: Enable Online MTP Training (Optional)
For online SFT of draft model during training:
--mtp-num-layers 1 \ --enable-mtp-training \ --mtp-loss-scaling-factor 0.2
Note: Online MTP training requires a torch dist checkpoint with MTP weights. Add
--mtp-num-layers 1 during checkpoint conversion from HuggingFace.
Expected Speedup
- Standard rollout: Baseline
- Speculative RL: 25-40% faster rollout
- With partial rollout: Additional 10-15% throughput
Configuration Reference
miles inherits all slime arguments. See slime API Reference for the complete list.
Cluster Resources (from slime)
--actor-num-nodes 1 --actor-num-gpus-per-node 8 --rollout-num-gpus 8 --rollout-num-gpus-per-engine 2 --colocate
Megatron Parallelism (from slime)
--tensor-model-parallel-size 8 --pipeline-model-parallel-size 2 --expert-model-parallel-size 4 # MoE expert parallelism
Speculative Decoding (miles-specific)
--sglang-speculative-algorithm EAGLE --sglang-speculative-num-steps 3 --sglang-speculative-eagle-topk 1 --sglang-speculative-num-draft-tokens 4 --sglang-enable-draft-weights-cpu-backup --sglang-speculative-draft-model-path /your/draft/model/path
Online MTP Training (miles-specific)
--mtp-num-layers 1 --enable-mtp-training --mtp-loss-scaling-factor 0.2
Key Features (Conceptual)
The following features are documented in miles but specific CLI flags may vary. Consult the miles repository for latest configuration.
Unified FP8 Pipeline
End-to-end FP8 sampling and training that eliminates quantization-induced discrepancy causing RL collapse in MoE models.
Rollout Routing Replay (R3)
Records expert routing decisions during SGLang inference and replays them during Megatron training for bit-wise expert alignment.
How R3 Works:
- During SGLang inference, expert routing decisions are recorded
- Routing decisions stored in
sample.rollout_routed_experts - During Megatron training, routing is replayed instead of recomputed
- Ensures identical expert selection between train and inference
INT4 Quantization-Aware Training
Enables single-machine deployment of 1TB+ models (e.g., on H200).
Memory Savings with INT4:
| Model Size | BF16 VRAM | INT4 VRAM | Reduction |
|---|---|---|---|
| 70B | 140GB | 45GB | 3.1x |
| 235B | 470GB | 150GB | 3.1x |
| 671B | 1.3TB | 420GB | 3.1x |
Train-Inference Alignment
miles achieves "exactly 0 KL divergence" between training and inference through:
- Flash Attention 3
- DeepGEMM
- Batch-invariant kernels from Thinking Machines Lab
integrationtorch.compile
Sample Data Structure
miles uses the same
Sample dataclass as slime with the rollout_routed_experts field for MoE routing replay:
@dataclass class Sample: prompt: str | list[dict] tokens: list[int] response: str reward: float | dict loss_mask: list[int] status: Status metadata: dict rollout_log_probs: list[float] rollout_routed_experts: list[list[int]] # MoE routing for R3
See slime API Reference for the complete Sample definition.
Common Issues and Solutions
Issue: FP8 Training Collapse
Symptoms: Loss explodes, NaN values
Solutions:
- Use block scaling:
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1 - Reduce learning rate:
--lr 5e-7 - Ensure MoE routing is consistent between train/inference
Issue: Speculative Draft Drift
Symptoms: Low acceptance rate over time
Solutions:
- Enable online MTP training to keep draft model aligned
- Reduce speculative steps:
--sglang-speculative-num-steps 2 - Use CPU backup:
--sglang-enable-draft-weights-cpu-backup
Issue: Train-Inference Mismatch
Symptoms: Policy divergence, reward collapse
Solutions:
- Use TIS for off-policy correction:
--use-tis --tis-threshold 0.9 - Verify log probs match between SGLang and Megatron
- Enable R3 for MoE models
Supported Models
| Family | Models | MoE Support |
|---|---|---|
| DeepSeek | R1, V3, V3.2 | Full |
| Qwen | 2, 2.5, 3 (including MoE) | Full |
| Llama | 3, 3.1, 3.3, 4 | Dense only |
| Gemma | 2, 3, 3N | Dense only |
| GLM | 4.5, 4.6, 4.7 | Dense only |
| MiniMax | M2, M2.1 | Full |
Resources
- GitHub: https://github.com/radixark/miles
- Introduction Blog: https://lmsys.org/blog/2025-11-19-miles/
- Slime (upstream): https://github.com/THUDM/slime
- SGLang: https://github.com/sgl-project/sglang