AutoResearchClaw distributed-training
Multi-GPU and distributed training patterns with PyTorch DDP. Use when scaling training across GPUs.
install
source · Clone the upstream repo
git clone https://github.com/aiming-lab/AutoResearchClaw
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aiming-lab/AutoResearchClaw "$T" && mkdir -p ~/.claude/skills && cp -r "$T/researchclaw/skills/builtin/tooling/distributed-training" ~/.claude/skills/aiming-lab-autoresearchclaw-distributed-training && rm -rf "$T"
manifest:
researchclaw/skills/builtin/tooling/distributed-training/SKILL.mdsource content
Distributed Training Best Practice
- Use DistributedDataParallel (DDP) over DataParallel for multi-GPU
- Initialize process group: dist.init_process_group(backend='nccl')
- Use DistributedSampler for data sharding
- Synchronize batch norm: nn.SyncBatchNorm.convert_sync_batchnorm()
- Only save checkpoint on rank 0
- Scale learning rate linearly with world size
- Use gradient accumulation for effectively larger batch sizes