Claude-code-plugins coreweave-performance-tuning
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/coreweave-pack/skills/coreweave-performance-tuning" ~/.claude/skills/jeremylongshore-claude-code-plugins-coreweave-performance-tuning && rm -rf "$T"
manifest:
plugins/saas-packs/coreweave-pack/skills/coreweave-performance-tuning/SKILL.mdsource content
CoreWeave Performance Tuning
GPU Selection by Workload
| Workload | Recommended GPU | Why |
|---|---|---|
| LLM inference (7-13B) | A100 80GB | Good balance of memory and cost |
| LLM inference (70B+) | 8xH100 | NVLink for tensor parallelism |
| Image generation | L40 | Good for diffusion models |
| Training (large models) | 8xH100 SXM5 | Fastest interconnect |
| Batch processing | A100 40GB | Cost-effective |
Inference Optimization
# Continuous batching with vLLM containers: - name: vllm args: - "--model=meta-llama/Llama-3.1-8B-Instruct" - "--max-num-batched-tokens=8192" - "--max-num-seqs=256" - "--gpu-memory-utilization=0.90" - "--enable-prefix-caching" - "--dtype=float16"
Autoscaling Tuning
# HPA based on GPU utilization apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: inference-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: inference-server minReplicas: 2 maxReplicas: 10 metrics: - type: Pods pods: metric: name: DCGM_FI_DEV_GPU_UTIL target: type: AverageValue averageValue: "70"
Performance Benchmarks
| Metric | A100-80GB | H100-80GB |
|---|---|---|
| Llama-8B tokens/sec | ~2,000 | ~4,500 |
| Llama-70B tokens/sec | ~200 (4x) | ~500 (4x) |
| Cold start (vLLM) | 30-60s | 20-40s |
Resources
Next Steps
For cost optimization, see
coreweave-cost-tuning.