Skilllibrary pretraining-pipeline
Build and configure LLM pretraining pipelines covering data loading (streaming/sharded), distributed training (FSDP, DeepSpeed ZeRO), learning rate schedules (warmup + cosine decay), gradient accumulation, checkpointing, and monitoring (loss curves, MFU, gradient norms). Use when setting up accelerate/deepspeed configs or torch.distributed training loops. Do not use for fine-tuning, inference, or model architecture design.
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/12-ai-llm-training-architecture-and-research/pretraining-pipeline" ~/.claude/skills/merceralex397-collab-skilllibrary-pretraining-pipeline && rm -rf "$T"
manifest:
12-ai-llm-training-architecture-and-research/pretraining-pipeline/SKILL.mdsource content
Purpose
Build end-to-end LLM pretraining pipelines covering tokenized data loading, distributed training configuration (FSDP/DeepSpeed), learning rate scheduling, gradient accumulation, checkpointing strategy, and training monitoring using accelerate, deepspeed, torch.distributed, and wandb.
When to use this skill
Use this skill when:
- Configuring distributed training:
accelerate launch --config_file accelerate_config.yaml train.py - Setting up DeepSpeed ZeRO stages:
withds_config.json"zero_optimization": {"stage": 2} - Designing LR schedule: warmup steps + cosine decay to
min_lr = peak_lr * 0.1 - Computing effective batch size:
effective_batch = per_device_batch * num_gpus * gradient_accumulation_steps - Implementing checkpointing: frequency, async saving, training resumption from checkpoint
- Monitoring training: loss curves, gradient norms, learning rate, MFU (Model FLOPs Utilization) via wandb
Do not use this skill when
- The task is about model architecture (layer types, attention variants) — use
model-architecture - The task is fine-tuning or LoRA on an existing model — use
fine-tuning - The task is about tokenizer training or vocabulary design — use
tokenizer-design - The task is about inference serving or deployment — use
serving-architecture
Operating procedure
- Prepare tokenized data: Tokenize corpus offline into packed binary shards. Use
for large datasets. Pack sequences todatasets.load_dataset("path", streaming=True)
withmax_seq_length
separators to avoid padding waste. Store as memory-mapped files or webdataset<eos>
shards for efficient streaming..tar - Configure distributed strategy:
- FSDP (
): Shards model parameters, gradients, and optimizer states across GPUs. Usetorch.distributed.fsdp
for maximum memory savings.FullyShardedDataParallel(model, sharding_strategy=ShardingStrategy.FULL_SHARD) - DeepSpeed ZeRO-1: Shards optimizer states only. Minimal communication overhead.
- DeepSpeed ZeRO-2: Shards optimizer states + gradients. Good balance of memory and speed.
- DeepSpeed ZeRO-3: Shards everything including parameters. Required for models that don't fit on a single GPU even without optimizer states.
- For multi-node: combine with tensor parallelism (Megatron-style) for models >70B.
- FSDP (
- Set learning rate schedule: Standard: linear warmup for ~2000 steps to peak LR, then cosine decay. Peak LR by model size: ~3e-4 for 1B, ~1.5e-4 for 7B, ~1e-4 for 13B+. Set
. Use AdamW withmin_lr = peak_lr * 0.1
,betas=(0.9, 0.95)
.weight_decay=0.1 - Configure gradient accumulation: Set
sogradient_accumulation_steps
reaches target (typically 2M-4M tokens). Example: 8 GPUs × 4 per-device × 32 accum steps × 2048 seq_len = ~2M tokens.effective_batch_size = per_gpu_batch * num_gpus * grad_accum_steps - Set checkpointing strategy: Save every 500-1000 steps. Use async checkpoint saving (
or DeepSpeed async checkpointing) to avoid training stalls. Keep last 3-5 checkpoints plus milestone saves. Always save optimizer state for resumption.torch.distributed.checkpoint - Configure monitoring: Log to wandb:
,training_loss
,gradient_norm
,learning_rate
,tokens_per_second
. Set gradient norm clipping:MFU
. Alert on loss spikes (>2x rolling average).max_grad_norm=1.0 - Validate pipeline: Run 100-step smoke test before full training. Verify: loss decreases, gradient norms are stable (typically 0.1-10.0), LR schedule matches config, checkpoint save/resume round-trips correctly.
Decision rules
- Use FSDP for PyTorch-native workflows; use DeepSpeed for maximum memory efficiency or when ZeRO-3 offloading to CPU/NVMe is needed
- ZeRO-2 is the default for models that fit in aggregate GPU memory; ZeRO-3 only when model params alone exceed total GPU VRAM
- Activation checkpointing (
) trades ~30% speed for ~50% memory reduction — enable for memory-constrained setupsgradient_checkpointing=True - Batch size ramp: start at 1/4 target batch size for first 5% of training, then ramp to full. Stabilizes early training.
- If MFU < 40% on A100s, investigate data loading bottlenecks, communication overhead, or kernel inefficiency
- For runs >1 day, log to persistent storage (wandb, tensorboard on shared FS) — never rely only on stdout
Output requirements
— Complete accelerate/deepspeed config with all parallelism, batch size, LR, and optimizer settingsTraining Config
— Tokenization format, shard layout, sequence packing strategy, and streaming configData Pipeline Spec
— GPU count, expected training time, tokens/second target, MFU target, total token budgetCompute Plan
— wandb project setup with tracked metrics: loss, grad norm, LR, throughput, MFUMonitoring Dashboard
References
- DeepSpeed ZeRO: Rajbhandari et al., "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models" (arxiv 1910.02054)
- PyTorch FSDP: https://pytorch.org/docs/stable/fsdp.html
- Hoffmann et al., "Training Compute-Optimal Large Language Models" — Chinchilla scaling laws (arxiv 2203.15556)
- HuggingFace Accelerate: https://huggingface.co/docs/accelerate
- Weights & Biases: https://docs.wandb.ai
Related skills
— defines the model structure this pipeline trainsmodel-architecture
— produces the tokenizer and vocab used in data preparationtokenizer-design
— MoE models require expert parallelism in addition to data/tensor parallelismmoe-architecture
— hardware provisioning and cluster setup for pretrainingtraining-infrastructure
Failure handling
- If loss spikes >3x rolling average, reduce LR by 50% and resume from last stable checkpoint. If persistent, check data corruption in recent shards.
- If gradient norms explode (>100), verify
clipping is active. Reduce LR or batch size. Check for NaN in data.max_grad_norm - If OOM during training, enable activation checkpointing first, then reduce per-device batch size, then move to higher ZeRO stage.
- If checkpoint resume produces different loss than pre-save, verify optimizer state and RNG state are both restored. DeepSpeed: check
returns success.load_checkpoint - If MFU drops suddenly mid-training, check for thermal throttling, network degradation, or a slow data loader worker.