Claude-code-plugins coreweave-performance-tuning

install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/coreweave-pack/skills/coreweave-performance-tuning" ~/.claude/skills/jeremylongshore-claude-code-plugins-coreweave-performance-tuning && rm -rf "$T"
manifest: plugins/saas-packs/coreweave-pack/skills/coreweave-performance-tuning/SKILL.md
source content

CoreWeave Performance Tuning

GPU Selection by Workload

WorkloadRecommended GPUWhy
LLM inference (7-13B)A100 80GBGood balance of memory and cost
LLM inference (70B+)8xH100NVLink for tensor parallelism
Image generationL40Good for diffusion models
Training (large models)8xH100 SXM5Fastest interconnect
Batch processingA100 40GBCost-effective

Inference Optimization

# Continuous batching with vLLM
containers:
  - name: vllm
    args:
      - "--model=meta-llama/Llama-3.1-8B-Instruct"
      - "--max-num-batched-tokens=8192"
      - "--max-num-seqs=256"
      - "--gpu-memory-utilization=0.90"
      - "--enable-prefix-caching"
      - "--dtype=float16"

Autoscaling Tuning

# HPA based on GPU utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: DCGM_FI_DEV_GPU_UTIL
        target:
          type: AverageValue
          averageValue: "70"

Performance Benchmarks

MetricA100-80GBH100-80GB
Llama-8B tokens/sec~2,000~4,500
Llama-70B tokens/sec~200 (4x)~500 (4x)
Cold start (vLLM)30-60s20-40s

Resources

Next Steps

For cost optimization, see

coreweave-cost-tuning
.