AutoResearchClaw distributed-training

Multi-GPU and distributed training patterns with PyTorch DDP. Use when scaling training across GPUs.

install
source · Clone the upstream repo
git clone https://github.com/aiming-lab/AutoResearchClaw
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aiming-lab/AutoResearchClaw "$T" && mkdir -p ~/.claude/skills && cp -r "$T/researchclaw/skills/builtin/tooling/distributed-training" ~/.claude/skills/aiming-lab-autoresearchclaw-distributed-training && rm -rf "$T"
manifest: researchclaw/skills/builtin/tooling/distributed-training/SKILL.md
source content

Distributed Training Best Practice

  1. Use DistributedDataParallel (DDP) over DataParallel for multi-GPU
  2. Initialize process group: dist.init_process_group(backend='nccl')
  3. Use DistributedSampler for data sharding
  4. Synchronize batch norm: nn.SyncBatchNorm.convert_sync_batchnorm()
  5. Only save checkpoint on rank 0
  6. Scale learning rate linearly with world size
  7. Use gradient accumulation for effectively larger batch sizes