Skilllibrary training-infrastructure
Configure distributed LLM training infrastructure—DDP, FSDP, DeepSpeed ZeRO, multi-node orchestration, checkpointing, fault tolerance, and mixed precision. Use when setting up torchrun/accelerate/deepspeed jobs, writing SLURM scripts, tuning NCCL, or debugging GPU memory and communication bottlenecks.
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/12-ai-llm-training-architecture-and-research/training-infrastructure" ~/.claude/skills/merceralex397-collab-skilllibrary-training-infrastructure && rm -rf "$T"
manifest:
12-ai-llm-training-architecture-and-research/training-infrastructure/SKILL.mdsource content
Purpose
Set up, configure, and debug distributed training infrastructure for large language models—covering parallelism strategies (DDP, FSDP, DeepSpeed ZeRO), multi-node orchestration, checkpointing, fault tolerance, mixed precision, and GPU profiling.
When to use this skill
- Configuring distributed training with
,torchrun
, oraccelerate launch
launcherdeepspeed - Choosing between DDP, FSDP (
), or DeepSpeed ZeRO stages 1/2/3ShardingStrategy.FULL_SHARD - Writing SLURM job scripts for multi-node GPU clusters
- Setting up checkpointing (async, distributed,
format)safetensors - Debugging OOM errors, NCCL timeouts, or GPU memory fragmentation
- Configuring mixed precision (BF16/FP16) with
or DeepSpeed configtorch.autocast - Profiling training with
or NVIDIA Nsight Systemstorch.profiler
Do not use this skill when
- The task is model architecture design (layers, attention)—use
model-architecture - The task is inference serving or deployment—use
serving-architecture - The task is hyperparameter tuning or training recipe design—use
pretraining-pipeline
Operating procedure
- Select parallelism strategy. For models that fit in one GPU with gradients: use DDP (
). For models requiring sharding: use FSDP or DeepSpeed ZeRO.torchrun --nproc_per_node=8- FSDP:
FullyShardedDataParallel(model, sharding_strategy=ShardingStrategy.FULL_SHARD, mixed_precision=MixedPrecision(param_dtype=torch.bfloat16)) - DeepSpeed ZeRO-2:
{"zero_optimization": {"stage": 2, "offload_optimizer": {"device": "cpu"}, "allgather_bucket_size": 5e8, "reduce_bucket_size": 5e8}} - For 70B+ models, combine tensor parallelism (Megatron-style column/row splits) with ZeRO-3 or FSDP for the remaining dimensions.
- FSDP:
- Configure multi-node launch. Use NCCL backend with
. Settorchrun --nproc_per_node=8 --nnodes=2 --node_rank=$RANK --master_addr=$MASTER --master_port=29500 train.py
for InfiniBand clusters; setNCCL_IB_DISABLE=0
for TCP.NCCL_SOCKET_IFNAME=eth0 - Set up mixed precision. Prefer BF16 on Ampere+ GPUs (no loss scaling needed). For FP16, enable dynamic loss scaling:
withtorch.cuda.amp.GradScaler(init_scale=2**16)
. In DeepSpeed config:torch.autocast("cuda", dtype=torch.float16)
.{"bf16": {"enabled": true}} - Configure checkpointing. Save every N steps using
for sharded models ortorch.distributed.checkpoint.save
for single-file. Enable async checkpointing to overlap save I/O with forward pass. Keep last K checkpoints; delete older ones.safetensors.torch.save_model() - Enable fault tolerance. Use
elastic launch (torchrun
). Implement heartbeat monitoring between nodes. Log to wandb with--max_restarts=3
so interrupted runs resume automatically.WANDB_RESUME=allow - Monitor GPU utilization. Run
during training. Target >80% GPU compute utilization; if lower, profile for communication bottlenecks. Watch for memory fragmentation vianvidia-smi dmon -s u -d 5
.torch.cuda.memory_stats()["allocated_bytes.all.peak"] - Profile bottlenecks. Use
. Export to Chrome trace or TensorBoard. Identify whether bottleneck is compute-bound (increase batch size) or communication-bound (overlap allreduce, use gradient compression).torch.profiler.profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], schedule=torch.profiler.schedule(wait=1, warmup=1, active=3)) - Write SLURM job script. Include
,#SBATCH --gpus-per-node=8
,#SBATCH --ntasks-per-node=1
. Setsrun torchrun ...
conservatively with checkpoint-resume for long jobs.--time
Decision rules
- If model fits in GPU memory with optimizer states: use DDP (simplest, fastest).
- If model fits but optimizer states don't: use ZeRO-2 or FSDP
.SHARD_GRAD_OP - If model parameters don't fit on one GPU: use ZeRO-3 or FSDP
.FULL_SHARD - If model exceeds 70B parameters: combine tensor parallelism with data parallelism.
- Always use BF16 over FP16 when hardware supports it—eliminates loss scaling complexity.
- Prefer
for single-codebase portability across DDP/FSDP/DeepSpeed.accelerate
Output requirements
— strategy chosen, DeepSpeed JSON or FSDP wrapper code, with justificationParallelism config
— exactLaunch command
/torchrun
/accelerate launch
command with all flagsdeepspeed
— format (sharded vs safetensors), frequency, retention policy, resume procedureCheckpoint plan
— GPU count, memory per GPU, estimated time, SLURM resource requestResource budget
References
- DeepSpeed docs: https://www.deepspeed.ai/docs/config-json/
- PyTorch FSDP tutorial: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html
- HuggingFace Accelerate: https://huggingface.co/docs/accelerate
- NVIDIA NCCL docs: https://docs.nvidia.com/deeplearning/nccl/user-guide/
- PyTorch distributed checkpoint: https://pytorch.org/docs/stable/distributed.checkpoint.html
Related skills
— parallelism strategy depends on model size and layer structuremodel-architecture
— training recipe runs on top of the infrastructure configured herepretraining-pipeline
— checkpoint format affects serving load pathserving-architecture
— merging sharded checkpoints requires compatible save formatmodel-merging
Failure handling
- If NCCL timeout occurs, verify network connectivity between nodes, increase
(default 1800s), and check for straggler GPUs withNCCL_TIMEOUT
.nvidia-smi - If OOM during forward pass with FSDP, enable
or reduce micro-batch size before switching to CPU offload.activation_checkpointing - If loss diverges after switching to FP16, switch to BF16 or increase
GradScaler
; check for overflow in gradient norms viainit_scale
.torch.nn.utils.clip_grad_norm_ - If checkpointing is too slow, enable async save, switch to
format, or write to local NVMe then async-copy to shared storage.safetensors