skypilot

Use when launching cloud VMs, Kubernetes pods, or Slurm jobs for GPU/TPU/CPU workloads, training or fine-tuning models on cloud GPUs, deploying inference servers (vllm, TGI, etc.) with autoscaling, writing or debugging SkyPilot task YAML files, using spot/preemptible instances for cost savings, comparing GPU prices across clouds, managing compute across 25+ clouds, Kubernetes, Slurm, and on-prem clusters with failover between them, troubleshooting resource availability or SkyPilot errors, or optimizing cost and GPU availability.

install

source · Clone the upstream repo

git clone https://github.com/skypilot-org/skypilot

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/skypilot-org/skypilot "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agent/skills/skypilot" ~/.claude/skills/skypilot-org-skypilot-skypilot && rm -rf "$T"

manifest: agent/skills/skypilot/SKILL.md

source content

SkyPilot Skill

SkyPilot is a unified framework to run AI workloads on any cloud, Slurm or Kubernetes. It provides a single interface to launch clusters, run jobs, and serve models across 25+ clouds (AWS, GCP, Azure, Coreweave, Nebius, Lambda, Together AI, RunPod, and more), Kubernetes clusters, and Slurm clusters.

When to Use SkyPilot

Use SkyPilot when you need to:

Manage compute resources on any cloud, Slurm, or Kubernetes cluster
Launch CPU/GPU/TPU (GB300, GB200, B200, H200, H100, etc.) on any cloud, Kubernetes or Slurm
Run training, fine-tuning, or batch inference jobs
Serve models with autoscaling and multi-cloud replicas (SkyServe)
Run long-running jobs with automatic lifecycle management and recovery (managed jobs)
Find the cheapest or most available GPU across clouds

Don't use SkyPilot for:

Local-only workloads (use Docker/conda directly)

Capabilities: When to Use What

SkyPilot has three core abstractions. Use the right one for each stage of your workflow:

1. SkyPilot Clusters (

sky launch

sky exec

) — Interactive development and debugging

Use during initial development, debugging, and experimentation
Launch a cluster, SSH in or connect VSCode/Cursor (
```
code --remote ssh-remote+CLUSTER
```
), iterate quickly
Cluster stays up until you stop/down it or autostop triggers
Best for: prototyping, debugging, short experiments

2. Managed Jobs (

sky jobs launch

) — Long-running training and batch jobs

Use when submitting long-running jobs that should run unattended
Manages the full lifecycle: provisioning, execution, recovery, and teardown
Automatically recovers from spot preemptions, quota limits, and transient failures
Works across clouds, Kubernetes, and Slurm (handles preemptions and quota)
Best for: training runs, fine-tuning, hyperparameter sweeps, batch inference

3. SkyServe (

sky serve up

) — Production model serving

Use when serving models at scale with autoscaling
Start with
```
sky launch
```
+ open port to test your serving setup, then use
```
sky serve up
```
to scale
Provides load balancing, autoscaling, and multi-cloud replicas
Best for: model serving endpoints, API services

Before You Start (Agent Bootstrap)

Bootstrap to confirm SkyPilot is installed, connected to an API server, and has cloud credentials. Once confirmed, skip straight to the user's task.

Step 1: Check installation and API server connectivity

sky api info

Output contains	Meaning	Next action
Server version and status	Server is running and connected	Bootstrap done. Skip to user's task.
`No SkyPilot API server is connected`	No server connected	Go to "Start or connect a server" below.
`Could not connect to SkyPilot API server`	Remote server unreachable or auth expired	Tell the user and suggest `sky api login --relogin -e <endpoint>` to reconnect.
`command not found: sky`	SkyPilot not installed	Go to "Install SkyPilot" below.

Install SkyPilot (only if

sky

command not found):

pip install "skypilot[aws,gcp,kubernetes]"  # Pick clouds the user needs

Ask the user which clouds they need if unclear, then re-run

sky api info

Start or connect a server (only if "not running"):

Ask the user:

Do you have an existing SkyPilot API server to connect to, or should I start one locally?

Connect to existing server:
```
sky api login -e <API_SERVER_URL>
```
— get the URL from the user.
Start locally:
```
sky api start
```

After either path, re-run

sky api info

to confirm the server is reachable.

Step 2: Check cloud credentials (only for fresh setups — skip if the server was already running)

sky check -o json

This shows which clouds are enabled or disabled. If the user's target cloud is not enabled, guide them through credential setup (see Troubleshooting).

Essential Commands

Use

-o json

with status/query commands to get structured JSON output instead of tables.

Clusters — interactive development and debugging:

Command	Description
`sky launch -c NAME task.yaml`	Launch a cluster or run a task
`sky exec NAME task.yaml`	Run task on existing cluster (skips provisioning); syncs workdir each time
`sky exec NAME task.yaml -d`	Same, but detach immediately (don't stream logs)
`sky status -o json`	Show all clusters
`sky logs NAME`	Stream job logs from a cluster
`sky logs NAME --no-follow`	Print existing logs and exit immediately
`sky logs NAME --tail 50`	Print last 50 lines of logs and exit
`sky logs NAME --status`	Exit with code 0=succeeded, 100=failed, 101=not finished, 102=not found, 103=cancelled
`sky queue NAME -o json`	List jobs on a cluster with status (structured JSON)
`sky stop NAME` / `sky start NAME`	Stop/restart to save costs (preserves disk)
`sky down NAME`	Tear down a cluster completely
`sky gpus list -o json`	List available GPU types across clouds

Managed Jobs — long-running unattended workloads:

Command	Description
`sky jobs launch task.yaml`	Launch a managed job (auto lifecycle + recovery)
`sky jobs queue -o json`	Show all managed jobs and their status
`sky jobs logs JOB_ID`	Stream logs from a managed job
`sky jobs cancel JOB_ID`	Cancel a managed job

SkyServe — model serving with autoscaling:

Command	Description
`sky serve up serve.yaml -n NAME`	Start a model serving service
`sky serve status NAME`	Show service status and endpoint URL
`sky serve update NAME new.yaml`	Update a running service (rolling)
`sky serve down NAME`	Tear down a service

For complete CLI reference, see CLI Reference.

Quick Start

# Launch a GPU cluster
sky launch -c mycluster --gpus H100 -- nvidia-smi

# Run a task from YAML
sky launch -c mycluster task.yaml

# SSH into cluster
ssh mycluster

# Connect VSCode or Cursor to the cluster for interactive development
code --remote ssh-remote+mycluster /home/user/sky_workdir
# or: cursor --remote ssh-remote+mycluster /home/user/sky_workdir

# Tear down
sky down mycluster

Task YAML Structure

The task YAML is SkyPilot's primary interface. All fields are optional.

# task.yaml
name: my-training-job

# Local directory to sync to remote ~/sky_workdir
workdir: .

# Number of nodes (for distributed training)
num_nodes: 1

resources:
  # GPU/TPU accelerators (SkyPilot auto-selects the cheapest cloud/region)
  accelerators: H200:8
  # Optional: pin to a specific cloud/region/infra
  # infra: aws  # or aws/us-east-1, k8s, ssh/my-pool
  # If infra is left out, SkyPilot automatically fails over across all
  # enabled clouds/regions to find the cheapest available option.
  # Use spot instances for cost savings
  use_spot: false
  # Disk size in GB
  disk_size: 256
  # Open ports for serving
  ports: 8080

# Environment variables (accessible in file_mounts, setup, and run)
envs:
  MODEL_NAME: my-model
  BATCH_SIZE: 32

# Setup: runs once on cluster creation, cached on reuse
setup: |
  pip install torch transformers

# Run: the main command
run: |
  python train.py --model $MODEL_NAME --batch-size $BATCH_SIZE

For complete YAML schema including file mounts, environment variables set by SkyPilot, and advanced fields, see YAML Specification.

GPU and Cloud Selection

IMPORTANT: Let SkyPilot choose the cloud and region. Do NOT manually pick a cloud/region/instance by parsing

sky gpus list

output. SkyPilot's optimizer automatically selects the cheapest available option across all enabled clouds. Only specify

infra:

when the user explicitly requests a specific cloud or region.

Default behavior (recommended): Just specify the GPU type. SkyPilot finds the cheapest cloud/region automatically:

resources:
  accelerators: H200:8  # SkyPilot picks the cheapest cloud/region with H200:8

If the user doesn't specify a GPU type, ask them what GPU they need (or what model/workload they're running so you can recommend one). Do NOT run

sky gpus list

and pick for them — present options and let the user decide, or use

any_of

to let SkyPilot maximize availability:

# Let SkyPilot choose from multiple acceptable GPU types (cheapest wins)
resources:
  any_of:
    - accelerators: H100:8
    - accelerators: A100-80GB:8
    - accelerators: A100:8

Use

ordered

only when the user has a strict preference:

# Try H100 first on AWS, fall back to GCP, then A100
resources:
  ordered:
    - infra: aws/us-east-1
      accelerators: H100:8
    - infra: gcp/us-central1
      accelerators: H100:8
    - infra: aws/us-west-2
      accelerators: A100-80GB:8

Only set

infra:

when the user explicitly says something like "use AWS" or "run on GCP us-central1":

resources:
  infra: aws             # User asked for AWS specifically
  accelerators: H100:8

Cluster Lifecycle

# Launch and run a task
sky launch -c mycluster task.yaml

# Launch with autostop at launch time (preferred: saves cost, no follow-up command needed)
sky launch -c mycluster task.yaml -i 30        # stop after 30 min idle
sky launch -c mycluster task.yaml -i 30 --down # tear down after 30 min idle

# Override or pass environment variables via CLI
sky launch -c mycluster task.yaml --env MODEL_NAME=llama3 --env BATCH_SIZE=64

# Re-run a different task on the same cluster (fast, skips provisioning)
sky exec mycluster another_task.yaml

# Run an inline command
sky exec mycluster -- python train.py --epochs 10

# Set autostop after launch (use if you forgot to set -i at launch time)
sky autostop mycluster -i 30        # stop after 30 min idle, preserving disk (can restart with sky start)
sky autostop mycluster -i 30 --down # tear down after 30 min idle (disk is deleted, cannot restart)

# Stop to save costs, restart later
sky stop mycluster
sky start mycluster

# Tear down completely
sky down mycluster

Workdir Sync Behavior

workdir:

is synced to

~/sky_workdir

on the remote via

rsync

before every

sky exec

. rsync is additive — deleted local files are NOT removed from the remote. This can cause experiments to run against stale build artifacts or old configs.

To ensure a clean slate, SSH and wipe before

sky exec

ssh mycluster "rm -rf ~/sky_workdir"
sky exec mycluster task.yaml

Or clean inside

run:

if only specific artifacts need removal:

run: |
  find ~/sky_workdir/build -name '*.o' -delete 2>/dev/null || true
  cd ~/sky_workdir && make

Managed Jobs

Use

sky jobs launch

for long-running jobs that should run unattended. SkyPilot manages the full lifecycle — provisioning, execution, recovery from preemptions/quota/failures, and teardown:

# managed-job.yaml
name: training-job

resources:
  accelerators: A100:8

run: |
  python train.py --resume-from-checkpoint

# Launch as managed job
sky jobs launch managed-job.yaml

# Check status
sky jobs queue -o json

# Stream logs
sky jobs logs <job_id>

# Cancel
sky jobs cancel <job_id>

Checkpoint pattern: Your training script should save checkpoints to persistent storage (cloud bucket or volume) and resume from the latest checkpoint on restart. SkyPilot handles the cluster recovery; your script handles the state recovery.

SkyServe: Model Serving

# serve.yaml
resources:
  accelerators: A100:1
  ports: 8080

run: |
  python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --port 8080

service:
  readiness_probe: /v1/models
  replica_policy:
    min_replicas: 1
    max_replicas: 3
    target_qps_per_replica: 5

# Start service
sky serve up serve.yaml -n my-llm

# Check status / get endpoint
sky serve status my-llm
sky serve status my-llm --endpoint

# Update (rolling)
sky serve update my-llm new-serve.yaml

# Tear down
sky serve down my-llm

Common Workflows

Fine-Tuning Workflow

Write task YAML with
```
setup
```
(install deps) and
```
run
```
(training command)
Use
```
file_mounts
```
or
```
workdir
```
to sync code
```
sky launch -c train task.yaml
```
to launch
```
sky logs train
```
to monitor
```
sky exec train -- python eval.py
```
to evaluate on same cluster
```
sky down train
```
when done

Hyperparameter Sweep

Create parameterized YAML with
```
envs
```

Launch multiple managed jobs:

for lr in 1e-4 1e-5 1e-6; do
  sky jobs launch sweep.yaml --env LR=$lr --name sweep-lr-$lr
done

Monitor with
```
sky jobs queue -o json
```

Model Serving Deployment

Write serve YAML with
```
service:
```
section
```
sky serve up serve.yaml -n my-service
```
Get endpoint:
```
sky serve status my-service --endpoint
```

Update model:

sky serve update my-service updated.yaml

Parallel Experiment Submission

Use

sky exec -d

to submit jobs to multiple VMs without blocking, then collect results:

# Submit all experiments (detached, returns after job is queued)
for i in 1 2 3 4; do
  sky exec exp-vm-0$i task.yaml --env LR=1e-$i -d
done

# Get the latest job ID from a cluster
job_id=$(sky queue exp-vm-01 -o json \
  | python3 -c "import sys, json; jobs = json.load(sys.stdin).get('exp-vm-01', []); print(max(j['job_id'] for j in jobs) if jobs else '')")

# Wait for a specific job and fetch last 50 lines
sky logs exp-vm-01 $job_id --status && sky logs exp-vm-01 $job_id --tail 50

# Check all jobs across a cluster at once
sky queue exp-vm-01 -o json

Agent Feedback Loop

When using SkyPilot programmatically, follow this loop:

Validate:
```
sky launch --dryrun task.yaml
```
(check resource availability/cost)
Launch:
```
sky launch -c mycluster task.yaml
```

Monitor:

sky status -o json

and

sky queue mycluster -o json

Wait for completion:
```
sky logs mycluster <JOB_ID>
```
(streams logs so you can observe progress and react to stalls; blocks until job finishes; get JOB_ID from
```
sky queue mycluster -o json
```
). For long-running jobs where you don't need intermediate output, use
```
sky logs mycluster <JOB_ID> --status
```
instead (blocks silently, exits 0 on success).

Inspect output:

sky logs mycluster <JOB_ID> --no-follow

sky logs mycluster <JOB_ID> --tail 100

Debug:
```
ssh mycluster
```
(interactive)
Iterate:
```
sky exec mycluster updated_task.yaml
```
(run on existing cluster)
Cleanup:
```
sky down mycluster
```

Never poll with
sleep
+
sky queue
— use
sky logs CLUSTER JOB_ID
to stream logs and block until done. Use
--status
if you only need the exit code, or
--tail N
to fetch recent output after completion.

Common Agent Mistakes

Mistake	Why it's wrong	Do this instead
Manually picking cloud/region from `sky gpus list` output	SkyPilot optimizer does this automatically and better	Just set `accelerators:` and let SkyPilot choose
Using `sky launch` for long-running unattended jobs	No recovery if preempted or interrupted	Use `sky jobs launch` for unattended work
Forgetting `sky down` or autostop after work is done	Wastes money on idle clusters	Always clean up, or use `-i <minutes> --down` at launch
Hardcoding `infra: aws` without user asking	Limits availability and increases cost	Only set `infra:` when user explicitly requests a cloud
Not using `envs:` for configurable values	Hard to reuse or override from CLI	Use `envs:` in YAML + `--env KEY=VAL` for parameterization
Running `sky launch` without `-c <name>`	Creates randomly-named cluster, hard to reference	Always name clusters with `-c`
Parsing table output from status commands	Table formatting is for humans, fragile to parse	Use `-o json` for structured output
Using deprecated `cloud:` / `region:` / `zone:` fields	Deprecated in favor of `infra:`	Use `infra: aws/us-east-1` instead
Polling job status with `sleep` + `sky queue`	Wastes tokens, introduces timing bugs, fragile	Use `sky logs CLUSTER JOB_ID --status` to block until done
Assuming workdir sync removes remote files	rsync is additive; old remote files persist across `sky exec` calls	SSH and manually clean `~/sky_workdir` , or clean in `run:` script
Not using `--tail` when only last output matters	Streaming full logs wastes tokens for long jobs	Use `sky logs CLUSTER JOB_ID --tail 50` for last N lines

Common Issues Quick Reference

Issue	Solution
GPU not available	Use `any_of` for fallback, or try different regions/clouds
Setup takes too long	SkyPilot caches setup; use `sky exec` to skip it on reruns
Task fails silently	Check `sky logs <cluster>` or `ssh <cluster>` to debug
Cluster stuck in INIT	`sky down <cluster>` and relaunch
Preemption/quota	Use `sky jobs launch` for automatic recovery and lifecycle management
Port not accessible	Ensure `ports:` is set in resources and security groups allow traffic
File sync slow	Use cloud bucket mounts instead of `workdir` for large datasets
Credentials error	Run `sky check -o json` and inspect which clouds are disabled

References

For detailed reference documentation:

CLI Reference — All commands and flags
YAML Specification — Complete task YAML schema, file mounts, environment variables
Python SDK — Programmatic API and SDK usage
Advanced Patterns — Multi-cloud, distributed training, production patterns
Troubleshooting — Error diagnosis and solutions
Examples — Copy-paste task YAML examples