skypilot
Use when launching cloud VMs, Kubernetes pods, or Slurm jobs for GPU/TPU/CPU workloads, training or fine-tuning models on cloud GPUs, deploying inference servers (vllm, TGI, etc.) with autoscaling, writing or debugging SkyPilot task YAML files, using spot/preemptible instances for cost savings, comparing GPU prices across clouds, managing compute across 25+ clouds, Kubernetes, Slurm, and on-prem clusters with failover between them, troubleshooting resource availability or SkyPilot errors, or optimizing cost and GPU availability.
git clone https://github.com/skypilot-org/skypilot
T=$(mktemp -d) && git clone --depth=1 https://github.com/skypilot-org/skypilot "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agent/skills/skypilot" ~/.claude/skills/skypilot-org-skypilot-skypilot && rm -rf "$T"
agent/skills/skypilot/SKILL.mdSkyPilot Skill
SkyPilot is a unified framework to run AI workloads on any cloud, Slurm or Kubernetes. It provides a single interface to launch clusters, run jobs, and serve models across 25+ clouds (AWS, GCP, Azure, Coreweave, Nebius, Lambda, Together AI, RunPod, and more), Kubernetes clusters, and Slurm clusters.
When to Use SkyPilot
Use SkyPilot when you need to:
- Manage compute resources on any cloud, Slurm, or Kubernetes cluster
- Launch CPU/GPU/TPU (GB300, GB200, B200, H200, H100, etc.) on any cloud, Kubernetes or Slurm
- Run training, fine-tuning, or batch inference jobs
- Serve models with autoscaling and multi-cloud replicas (SkyServe)
- Run long-running jobs with automatic lifecycle management and recovery (managed jobs)
- Find the cheapest or most available GPU across clouds
Don't use SkyPilot for:
- Local-only workloads (use Docker/conda directly)
Capabilities: When to Use What
SkyPilot has three core abstractions. Use the right one for each stage of your workflow:
1. SkyPilot Clusters (
sky launch / sky exec) — Interactive development and debugging
- Use during initial development, debugging, and experimentation
- Launch a cluster, SSH in or connect VSCode/Cursor (
), iterate quicklycode --remote ssh-remote+CLUSTER - Cluster stays up until you stop/down it or autostop triggers
- Best for: prototyping, debugging, short experiments
2. Managed Jobs (
sky jobs launch) — Long-running training and batch jobs
- Use when submitting long-running jobs that should run unattended
- Manages the full lifecycle: provisioning, execution, recovery, and teardown
- Automatically recovers from spot preemptions, quota limits, and transient failures
- Works across clouds, Kubernetes, and Slurm (handles preemptions and quota)
- Best for: training runs, fine-tuning, hyperparameter sweeps, batch inference
3. SkyServe (
sky serve up) — Production model serving
- Use when serving models at scale with autoscaling
- Start with
+ open port to test your serving setup, then usesky launch
to scalesky serve up - Provides load balancing, autoscaling, and multi-cloud replicas
- Best for: model serving endpoints, API services
Before You Start (Agent Bootstrap)
Bootstrap to confirm SkyPilot is installed, connected to an API server, and has cloud credentials. Once confirmed, skip straight to the user's task.
Step 1: Check installation and API server connectivity
sky api info
| Output contains | Meaning | Next action |
|---|---|---|
| Server version and status | Server is running and connected | Bootstrap done. Skip to user's task. |
| No server connected | Go to "Start or connect a server" below. |
| Remote server unreachable or auth expired | Tell the user and suggest to reconnect. |
| SkyPilot not installed | Go to "Install SkyPilot" below. |
Install SkyPilot (only if
sky command not found):
pip install "skypilot[aws,gcp,kubernetes]" # Pick clouds the user needs
Ask the user which clouds they need if unclear, then re-run
sky api info.
Start or connect a server (only if "not running"):
Ask the user:
Do you have an existing SkyPilot API server to connect to, or should I start one locally?
- Connect to existing server:
— get the URL from the user.sky api login -e <API_SERVER_URL> - Start locally:
sky api start
After either path, re-run
sky api info to confirm the server is reachable.
Step 2: Check cloud credentials (only for fresh setups — skip if the server was already running)
sky check -o json
This shows which clouds are enabled or disabled. If the user's target cloud is not enabled, guide them through credential setup (see Troubleshooting).
Essential Commands
Use
-o json with status/query commands to get structured JSON output instead of tables.
Clusters — interactive development and debugging:
| Command | Description |
|---|---|
| Launch a cluster or run a task |
| Run task on existing cluster (skips provisioning); syncs workdir each time |
| Same, but detach immediately (don't stream logs) |
| Show all clusters |
| Stream job logs from a cluster |
| Print existing logs and exit immediately |
| Print last 50 lines of logs and exit |
| Exit with code 0=succeeded, 100=failed, 101=not finished, 102=not found, 103=cancelled |
| List jobs on a cluster with status (structured JSON) |
/ | Stop/restart to save costs (preserves disk) |
| Tear down a cluster completely |
| List available GPU types across clouds |
Managed Jobs — long-running unattended workloads:
| Command | Description |
|---|---|
| Launch a managed job (auto lifecycle + recovery) |
| Show all managed jobs and their status |
| Stream logs from a managed job |
| Cancel a managed job |
SkyServe — model serving with autoscaling:
| Command | Description |
|---|---|
| Start a model serving service |
| Show service status and endpoint URL |
| Update a running service (rolling) |
| Tear down a service |
For complete CLI reference, see CLI Reference.
Quick Start
# Launch a GPU cluster sky launch -c mycluster --gpus H100 -- nvidia-smi # Run a task from YAML sky launch -c mycluster task.yaml # SSH into cluster ssh mycluster # Connect VSCode or Cursor to the cluster for interactive development code --remote ssh-remote+mycluster /home/user/sky_workdir # or: cursor --remote ssh-remote+mycluster /home/user/sky_workdir # Tear down sky down mycluster
Task YAML Structure
The task YAML is SkyPilot's primary interface. All fields are optional.
# task.yaml name: my-training-job # Local directory to sync to remote ~/sky_workdir workdir: . # Number of nodes (for distributed training) num_nodes: 1 resources: # GPU/TPU accelerators (SkyPilot auto-selects the cheapest cloud/region) accelerators: H200:8 # Optional: pin to a specific cloud/region/infra # infra: aws # or aws/us-east-1, k8s, ssh/my-pool # If infra is left out, SkyPilot automatically fails over across all # enabled clouds/regions to find the cheapest available option. # Use spot instances for cost savings use_spot: false # Disk size in GB disk_size: 256 # Open ports for serving ports: 8080 # Environment variables (accessible in file_mounts, setup, and run) envs: MODEL_NAME: my-model BATCH_SIZE: 32 # Setup: runs once on cluster creation, cached on reuse setup: | pip install torch transformers # Run: the main command run: | python train.py --model $MODEL_NAME --batch-size $BATCH_SIZE
For complete YAML schema including file mounts, environment variables set by SkyPilot, and advanced fields, see YAML Specification.
GPU and Cloud Selection
IMPORTANT: Let SkyPilot choose the cloud and region. Do NOT manually pick a cloud/region/instance by parsing
sky gpus list output. SkyPilot's optimizer automatically selects the cheapest available option across all enabled clouds. Only specify infra: when the user explicitly requests a specific cloud or region.
Default behavior (recommended): Just specify the GPU type. SkyPilot finds the cheapest cloud/region automatically:
resources: accelerators: H200:8 # SkyPilot picks the cheapest cloud/region with H200:8
If the user doesn't specify a GPU type, ask them what GPU they need (or what model/workload they're running so you can recommend one). Do NOT run
sky gpus list and pick for them — present options and let the user decide, or use any_of to let SkyPilot maximize availability:
# Let SkyPilot choose from multiple acceptable GPU types (cheapest wins) resources: any_of: - accelerators: H100:8 - accelerators: A100-80GB:8 - accelerators: A100:8
Use
ordered only when the user has a strict preference:
# Try H100 first on AWS, fall back to GCP, then A100 resources: ordered: - infra: aws/us-east-1 accelerators: H100:8 - infra: gcp/us-central1 accelerators: H100:8 - infra: aws/us-west-2 accelerators: A100-80GB:8
Only set
infra: when the user explicitly says something like "use AWS" or "run on GCP us-central1":
resources: infra: aws # User asked for AWS specifically accelerators: H100:8
Cluster Lifecycle
# Launch and run a task sky launch -c mycluster task.yaml # Launch with autostop at launch time (preferred: saves cost, no follow-up command needed) sky launch -c mycluster task.yaml -i 30 # stop after 30 min idle sky launch -c mycluster task.yaml -i 30 --down # tear down after 30 min idle # Override or pass environment variables via CLI sky launch -c mycluster task.yaml --env MODEL_NAME=llama3 --env BATCH_SIZE=64 # Re-run a different task on the same cluster (fast, skips provisioning) sky exec mycluster another_task.yaml # Run an inline command sky exec mycluster -- python train.py --epochs 10 # Set autostop after launch (use if you forgot to set -i at launch time) sky autostop mycluster -i 30 # stop after 30 min idle, preserving disk (can restart with sky start) sky autostop mycluster -i 30 --down # tear down after 30 min idle (disk is deleted, cannot restart) # Stop to save costs, restart later sky stop mycluster sky start mycluster # Tear down completely sky down mycluster
Workdir Sync Behavior
workdir: is synced to ~/sky_workdir on the remote via rsync before every sky exec. rsync is additive — deleted local files are NOT removed from the remote. This can cause experiments to run against stale build artifacts or old configs.
To ensure a clean slate, SSH and wipe before
sky exec:
ssh mycluster "rm -rf ~/sky_workdir" sky exec mycluster task.yaml
Or clean inside
run: if only specific artifacts need removal:
run: | find ~/sky_workdir/build -name '*.o' -delete 2>/dev/null || true cd ~/sky_workdir && make
Managed Jobs
Use
sky jobs launch for long-running jobs that should run unattended. SkyPilot manages the full lifecycle — provisioning, execution, recovery from preemptions/quota/failures, and teardown:
# managed-job.yaml name: training-job resources: accelerators: A100:8 run: | python train.py --resume-from-checkpoint
# Launch as managed job sky jobs launch managed-job.yaml # Check status sky jobs queue -o json # Stream logs sky jobs logs <job_id> # Cancel sky jobs cancel <job_id>
Checkpoint pattern: Your training script should save checkpoints to persistent storage (cloud bucket or volume) and resume from the latest checkpoint on restart. SkyPilot handles the cluster recovery; your script handles the state recovery.
SkyServe: Model Serving
# serve.yaml resources: accelerators: A100:1 ports: 8080 run: | python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --port 8080 service: readiness_probe: /v1/models replica_policy: min_replicas: 1 max_replicas: 3 target_qps_per_replica: 5
# Start service sky serve up serve.yaml -n my-llm # Check status / get endpoint sky serve status my-llm sky serve status my-llm --endpoint # Update (rolling) sky serve update my-llm new-serve.yaml # Tear down sky serve down my-llm
Common Workflows
Fine-Tuning Workflow
- Write task YAML with
(install deps) andsetup
(training command)run - Use
orfile_mounts
to sync codeworkdir
to launchsky launch -c train task.yaml
to monitorsky logs train
to evaluate on same clustersky exec train -- python eval.py
when donesky down train
Hyperparameter Sweep
- Create parameterized YAML with
envs - Launch multiple managed jobs:
for lr in 1e-4 1e-5 1e-6; do sky jobs launch sweep.yaml --env LR=$lr --name sweep-lr-$lr done - Monitor with
sky jobs queue -o json
Model Serving Deployment
- Write serve YAML with
sectionservice: sky serve up serve.yaml -n my-service- Get endpoint:
sky serve status my-service --endpoint - Update model:
sky serve update my-service updated.yaml
Parallel Experiment Submission
Use
sky exec -d to submit jobs to multiple VMs without blocking, then collect results:
# Submit all experiments (detached, returns after job is queued) for i in 1 2 3 4; do sky exec exp-vm-0$i task.yaml --env LR=1e-$i -d done # Get the latest job ID from a cluster job_id=$(sky queue exp-vm-01 -o json \ | python3 -c "import sys, json; jobs = json.load(sys.stdin).get('exp-vm-01', []); print(max(j['job_id'] for j in jobs) if jobs else '')") # Wait for a specific job and fetch last 50 lines sky logs exp-vm-01 $job_id --status && sky logs exp-vm-01 $job_id --tail 50 # Check all jobs across a cluster at once sky queue exp-vm-01 -o json
Agent Feedback Loop
When using SkyPilot programmatically, follow this loop:
- Validate:
(check resource availability/cost)sky launch --dryrun task.yaml - Launch:
sky launch -c mycluster task.yaml - Monitor:
andsky status -o jsonsky queue mycluster -o json - Wait for completion:
(streams logs so you can observe progress and react to stalls; blocks until job finishes; get JOB_ID fromsky logs mycluster <JOB_ID>
). For long-running jobs where you don't need intermediate output, usesky queue mycluster -o json
instead (blocks silently, exits 0 on success).sky logs mycluster <JOB_ID> --status - Inspect output:
orsky logs mycluster <JOB_ID> --no-followsky logs mycluster <JOB_ID> --tail 100 - Debug:
(interactive)ssh mycluster - Iterate:
(run on existing cluster)sky exec mycluster updated_task.yaml - Cleanup:
sky down mycluster
Never poll with
+sleep— usesky queueto stream logs and block until done. Usesky logs CLUSTER JOB_IDif you only need the exit code, or--statusto fetch recent output after completion.--tail N
Common Agent Mistakes
| Mistake | Why it's wrong | Do this instead |
|---|---|---|
Manually picking cloud/region from output | SkyPilot optimizer does this automatically and better | Just set and let SkyPilot choose |
Using for long-running unattended jobs | No recovery if preempted or interrupted | Use for unattended work |
Forgetting or autostop after work is done | Wastes money on idle clusters | Always clean up, or use at launch |
Hardcoding without user asking | Limits availability and increases cost | Only set when user explicitly requests a cloud |
Not using for configurable values | Hard to reuse or override from CLI | Use in YAML + for parameterization |
Running without | Creates randomly-named cluster, hard to reference | Always name clusters with |
| Parsing table output from status commands | Table formatting is for humans, fragile to parse | Use for structured output |
Using deprecated // fields | Deprecated in favor of | Use instead |
Polling job status with + | Wastes tokens, introduces timing bugs, fragile | Use to block until done |
| Assuming workdir sync removes remote files | rsync is additive; old remote files persist across calls | SSH and manually clean , or clean in script |
Not using when only last output matters | Streaming full logs wastes tokens for long jobs | Use for last N lines |
Common Issues Quick Reference
| Issue | Solution |
|---|---|
| GPU not available | Use for fallback, or try different regions/clouds |
| Setup takes too long | SkyPilot caches setup; use to skip it on reruns |
| Task fails silently | Check or to debug |
| Cluster stuck in INIT | and relaunch |
| Preemption/quota | Use for automatic recovery and lifecycle management |
| Port not accessible | Ensure is set in resources and security groups allow traffic |
| File sync slow | Use cloud bucket mounts instead of for large datasets |
| Credentials error | Run and inspect which clouds are disabled |
References
For detailed reference documentation:
- CLI Reference — All commands and flags
- YAML Specification — Complete task YAML schema, file mounts, environment variables
- Python SDK — Programmatic API and SDK usage
- Advanced Patterns — Multi-cloud, distributed training, production patterns
- Troubleshooting — Error diagnosis and solutions
- Examples — Copy-paste task YAML examples