Claude-code-plugins-plus-skills vastai-performance-tuning
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/vastai-pack/skills/vastai-performance-tuning" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-vastai-performance-tuning && rm -rf "$T"
manifest:
plugins/saas-packs/vastai-pack/skills/vastai-performance-tuning/SKILL.mdsource content
Vast.ai Performance Tuning
Overview
Optimize GPU instance selection, startup time, and training throughput on Vast.ai. Key levers: Docker image caching, GPU selection by dlperf score, data pipeline optimization, and multi-GPU scaling.
Prerequisites
- Vast.ai account with active or planned instances
- Understanding of GPU compute bottlenecks
- Profiling tools (nvidia-smi, torch.profiler)
Instructions
Step 1: Optimize Instance Selection by Performance
# Sort by dlperf (deep learning performance benchmark) instead of price vastai search offers 'num_gpus=1 gpu_ram>=24 reliability>0.95' \ --order 'dlperf-' --limit 10 # The dlperf field measures actual GPU compute throughput # Higher dlperf = faster training even at same GPU model # Variance within same GPU model can be 20-30%
def select_by_performance_per_dollar(offers): """Select the offer with best performance per dollar.""" for o in offers: o["perf_per_dollar"] = o.get("dlperf", 0) / max(o["dph_total"], 0.01) return max(offers, key=lambda o: o["perf_per_dollar"])
Step 2: Reduce Instance Startup Time
# Use smaller, pre-cached Docker images # FAST: nvidia/cuda:12.1.1-runtime-ubuntu22.04 (~2GB, widely cached) # MEDIUM: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime (~4GB) # SLOW: custom-image:latest with pip install at build (~10GB+) # Pre-install deps in the image, not in onstart # BAD (slow startup): vastai create instance $ID --image pytorch/pytorch:latest \ --onstart-cmd "pip install transformers datasets wandb" # GOOD (fast startup): # Build custom image with all deps pre-installed
Step 3: Data Pipeline Optimization
# Profile GPU utilization on the instance # SSH into instance and run: """ watch -n 1 nvidia-smi # Check if GPU util is <80% → data bottleneck # Common fixes for low GPU utilization: # 1. Increase DataLoader num_workers # 2. Use pin_memory=True # 3. Pre-fetch data to local SSD (not NFS) # 4. Use WebDataset or FFCV for streaming datasets """ # Optimize PyTorch DataLoader from torch.utils.data import DataLoader loader = DataLoader( dataset, batch_size=32, num_workers=4, # Match CPU cores on instance pin_memory=True, # Faster GPU transfer prefetch_factor=2, # Pre-load 2 batches per worker persistent_workers=True, # Don't respawn workers each epoch )
Step 4: GPU Memory Optimization
# Check available VRAM before selecting batch size import torch def optimal_batch_size(model, sample_input, gpu_memory_gb): """Binary search for largest batch size that fits in VRAM.""" lo, hi, best = 1, 512, 1 while lo <= hi: mid = (lo + hi) // 2 try: torch.cuda.empty_cache() batch = sample_input.repeat(mid, *([1] * (sample_input.dim() - 1))) _ = model(batch.cuda()) best = mid lo = mid + 1 except torch.cuda.OutOfMemoryError: hi = mid - 1 torch.cuda.empty_cache() return best
Step 5: Multi-GPU Scaling
# Search for multi-GPU offers (NVLink preferred for training) vastai search offers 'num_gpus>=4 gpu_name=A100 total_flops>=100' \ --order 'dph_total' --limit 5 # Use torchrun for distributed training ssh -p $PORT root@$HOST "torchrun --nproc_per_node=4 train.py --batch-size 128"
GPU Performance Reference
| GPU | VRAM | FP16 TFLOPS | Typical $/hr | Best For |
|---|---|---|---|---|
| RTX 4090 | 24GB | 82.6 | $0.15-0.30 | Fine-tuning, inference |
| A100 40GB | 40GB | 77.97 | $0.80-1.50 | Training medium models |
| A100 80GB | 80GB | 77.97 | $1.00-2.00 | Training large models |
| H100 SXM | 80GB | 267 | $2.50-4.00 | High-throughput training |
Output
- Performance-per-dollar offer selection
- Optimized Docker image for fast startup
- Data pipeline tuning (DataLoader, pin_memory, workers)
- GPU memory optimization with auto batch sizing
- Multi-GPU scaling with torchrun
Error Handling
| Error | Cause | Solution |
|---|---|---|
| Low GPU utilization (<50%) | Data pipeline bottleneck | Increase , use |
| OOM during training | Batch size too large | Use or gradient accumulation |
| Slow instance startup | Large Docker image | Pre-install deps in image, not onstart |
| Poor multi-GPU scaling | Communication bottleneck | Use NVLink-connected GPUs, reduce sync frequency |
Resources
Next Steps
For cost optimization, see
vastai-cost-tuning.
Examples
Profile first: SSH into instance, run
watch nvidia-smi during training. If GPU-Util < 80%, the bottleneck is data loading, not compute.
Best value GPU: Use
perf_per_dollar scoring to find hosts where the same GPU model runs faster due to better cooling or fewer co-tenants.