Cc-skills pueue-job-orchestration

Manage long-running jobs and batch processing with pueue queue orchestration, CLI telemetry, and companion monitoring tools (noti, ntfy, mprocs, task-spooler). Use whenever the user wants to queue a job, run tasks on bigblack, manage long-running processes, do batch processing, set up pueue callbacks or priorities, get notified when jobs finish (noti/ntfy), watch multiple processes (mprocs), or needs GPU workstation job management. Also use for cache population tasks that run in the background. Do NOT use for simple foreground shell commands that complete quickly or for job scheduling via cron/launchd.

install

source · Clone the upstream repo

git clone https://github.com/terrylica/cc-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/terrylica/cc-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/devops-tools/skills/pueue-job-orchestration" ~/.claude/skills/terrylica-cc-skills-pueue-job-orchestration && rm -rf "$T"

manifest: plugins/devops-tools/skills/pueue-job-orchestration/SKILL.md

source content

Pueue Job Orchestration

Universal CLI telemetry layer and job management — every command routed through pueue gets precise timing, exit code capture, full stdout/stderr logs, environment snapshots, and callback-on-completion.

Self-Evolving Skill: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.

Overview

Pueue is a Rust CLI tool for managing shell command queues. It provides:

Daemon persistence - Survives SSH disconnects, crashes, reboots
Disk-backed queue - Auto-resumes after any failure
Group-based parallelism - Control concurrent jobs per group
Easy failure recovery - Restart failed jobs with one command
Full telemetry - Timing, exit codes, stdout/stderr logs, env snapshots per task

When to Route Through Pueue

Operation	Route Through Pueue?	Why
Any command >30 seconds	Always	Telemetry, persistence, log capture
Batch operations (>3 items)	Always	Parallelism control, failure isolation
Build/test pipelines	Recommended	`--after` DAGs, group monitoring
Data processing	Always	Checkpoint resume, state management
Quick one-off commands (<5s)	Optional	Overhead is ~100ms, but you get logs
Interactive commands (editors, REPLs)	Never	Pueue can't handle stdin interaction

When to Use This Skill

Use this skill when the user mentions:

Trigger	Example
Running tasks on BigBlack	"Run this on bigblack"
Long-running data processing	"Populate the cache for all symbols"
Batch/parallel operations	"Process these 70 jobs"
SSH remote execution	"Execute this overnight on the GPU server"
Cache population	"Fill the ClickHouse cache"
Pueue features	"Set up a callback", "delay this job"

Quick Reference

Check Status

# Local
pueue status

# Remote (BigBlack)
ssh bigblack "~/.local/bin/pueue status"

Queue a Job

# Local (with working directory)
pueue add -w ~/project -- python long_running_script.py

# Local (simple)
pueue add -- python long_running_script.py

# Remote (BigBlack)
ssh bigblack "~/.local/bin/pueue add -w ~/project -- uv run python script.py"

# With group (for parallelism control)
pueue add --group p1 --label "BTCUSDT@1000" -w ~/project -- python populate.py --symbol BTCUSDT

Monitor Jobs

pueue follow <id>         # Watch job output in real-time
pueue log <id>            # View completed job output
pueue log <id> --full     # Full output (not truncated)

Manage Jobs

pueue restart <id>         # ⚠ Creates NEW task (see warning below)
pueue restart --in-place <id>  # Restarts task in-place (no new ID)
pueue restart --all-failed # ⚠ Restarts ALL failed across ALL groups
pueue kill <id>            # Kill running job
pueue clean                # Remove completed jobs from list
pueue reset                # Clear all jobs (use with caution)

CRITICAL WARNING —

pueue restart

semantics:

pueue restart <id>

does NOT restart the task in-place. It creates a brand new task with a new ID, copying the command from the original. The original stays as Done/Failed. This causes exponential task growth when used in loops or by autonomous agents. In a 2026-03-04 incident, agents calling

pueue restart

on failed tasks grew 60 jobs to ~12,800.

Use
--in-place
if you truly need to restart:
```
pueue restart --in-place <id>
```
Verify before restart: Read
```
pueue log <id>
```
to check if the failure is persistent (missing data, bad args) — retrying will never help
Never use
--all-failed
without
```
--group
```
filter — it restarts every failed task across ALL groups

Host Configuration

Host	Location	GPU	Parallelism Groups
BigBlack	`~/.local/bin/pueue`	RTX 4090	p1 (16), p2 (2), p3 (3), p4 (1)
LittleBlack	`~/.local/bin/pueue`	RTX 2080 Ti	p1 (8), p2 (2)
Local (macOS)	`/opt/homebrew/bin/pueue`	N/A	default

Core Workflows

1. Queue Single Remote Job

# Step 1: Verify daemon is running
ssh bigblack "~/.local/bin/pueue status"

# Step 2: Queue the job
ssh bigblack "~/.local/bin/pueue add --label 'my-job' -- cd ~/project && uv run python script.py"

# Step 3: Monitor progress
ssh bigblack "~/.local/bin/pueue follow <id>"

2. Batch Job Submission (Multiple Symbols)

For rangebar cache population or similar batch operations:

# Use the pueue-populate.sh script
ssh bigblack "cd ~/rangebar-py && ./scripts/pueue-populate.sh setup"   # One-time
ssh bigblack "cd ~/rangebar-py && ./scripts/pueue-populate.sh phase1"  # Queue Phase 1
ssh bigblack "cd ~/rangebar-py && ./scripts/pueue-populate.sh status"  # Check progress

3. Configure Parallelism Groups

# Create groups with different parallelism limits
pueue group add fast      # Create 'fast' group
pueue parallel 4 --group fast  # Allow 4 parallel jobs

pueue group add slow
pueue parallel 1 --group slow  # Sequential execution

# Queue jobs to specific groups
pueue add --group fast -- echo "fast job"
pueue add --group slow -- echo "slow job"

4. Handle Failed Jobs

# Check what failed
pueue status | grep Failed

# View error output FIRST (distinguish transient vs persistent)
pueue log <id>

# Restart specific job IN-PLACE (no new task created)
pueue restart --in-place <id>

# ⚠ NEVER blindly restart all failed — classify failures first
# pueue restart --all-failed  # DANGEROUS: no group filter, creates duplicates

Failure classification before restart: Exit code 1 (app error) may be persistent (missing data, bad args — will never succeed). Exit code 137 (OOM) or 143 (SIGTERM) may be transient. Always read

pueue log

before restarting.

Troubleshooting

Issue	Cause	Solution
`pueue: command not found`	Not in PATH	Use full path: `~/.local/bin/pueue`
`Connection refused`	Daemon not running	Start with `pueued -d`
Jobs stuck in Queued	Group paused or at limit	Check `pueue status` , `pueue start`
SSH disconnect kills jobs	Not using Pueue	Queue via Pueue instead of direct SSH
Job fails immediately	Wrong working directory	Use `pueue add -w /path` or `cd /path && pueue add`

Priority Scheduling (

--priority

)

Higher priority number = runs first when a queue slot opens:

# Urgent validation (runs before queued lower-priority jobs)
pueue add --priority 10 -- python validate_critical.py

# Normal compute (default priority is 0)
pueue add -- python train_model.py

# Low-priority background task
pueue add --priority -5 -- python cleanup_logs.py

Priority only affects queued jobs waiting for an open slot. Running jobs are not preempted.

Per-Task Environment Override (

pueue env

)

Inject or override environment variables on stashed or queued tasks:

# Create a stashed job
JOB_ID=$(pueue add --stashed --print-task-id -- python train.py)

# Set environment variables (NOTE: separate args, NOT KEY=VALUE)
pueue env set "$JOB_ID" BATCH_SIZE 64
pueue env set "$JOB_ID" LEARNING_RATE 0.001

# Enqueue when ready
pueue enqueue "$JOB_ID"

Syntax:

pueue env set <id> KEY VALUE

-- the key and value are separate positional arguments.

Constraint: Only works on stashed/queued tasks. Cannot modify environment of running tasks.

Relationship to mise.toml

[env]

: mise

[env]

remains the SSoT for default environment. Use

pueue env set

only for one-off overrides (e.g., hyperparameter sweeps) without modifying config files.

Blocking Wait (

pueue wait

)

Block until tasks complete -- simpler than polling loops for scripts:

# Wait for specific task
pueue wait 42

# Wait for all tasks in a group
pueue wait --group mygroup

# Wait for ALL tasks across all groups
pueue wait --all

# Wait quietly (no progress output)
pueue wait 42 --quiet

# Wait for tasks to reach a specific status
pueue wait --status queued

Script Integration Pattern

# Queue -> wait -> process results
TASK_ID=$(pueue add --print-task-id -- python etl_pipeline.py)
pueue wait "$TASK_ID" --quiet
EXIT_CODE=$(pueue status --json | jq -r ".tasks[\"$TASK_ID\"].status.Done.result" 2>/dev/null)
if [ "$EXIT_CODE" = "Success" ]; then
    echo "Pipeline succeeded"
    pueue log "$TASK_ID" --full
else
    echo "Pipeline failed"
    pueue log "$TASK_ID" --full >&2
fi

Companion CLI Tools (bigblack)

Four lightweight CLI tools complement pueue for monitoring and notifications. All installed on bigblack at

/usr/local/bin/

(noti, mprocs) or via apt (ntfy, task-spooler).

noti — Wrap Any Command with Telegram Notification

Best for one-off "notify me when this finishes" workflows. Native Telegram support.

# Wrap any command — sends Telegram when it finishes
noti -g mise run kintsugi:catchup

# With execution time in message
noti -g -e python long_script.py

# Custom title/message
noti -g -t "Deploy done" -m "bigblack updated" mise run deploy:bigblack

Config:

~/.config/noti/noti.yaml

+ env vars

NOTI_TELEGRAM_TOKEN

NOTI_TELEGRAM_CHAT_ID

NOTI_DEFAULT=telegram

~/.bashrc

ntfy — Push Notifications from Scripts/Callbacks

Best for programmatic notifications from scripts, pueue callbacks, or curl one-liners.

# One-liner from any script
ntfy pub --title "Backfill done" odb-alerts "BTCUSDT@500 complete"

# With priority and tags
ntfy pub --priority high --tags "warning" odb-alerts "Job failed"

# curl variant (works anywhere, no binary needed)
curl -d "Backup finished" ntfy.sh/my-topic

# Subscribe to topic (poll mode)
ntfy sub --poll odb-alerts

Pueue callback integration — add to

~/.config/pueue/pueue.yml

callback: 'ntfy pub --title "pueue: {{label}}" --tags "{{status}}" odb-alerts "Task #{{id}} {{status}} ({{command}})"'

mprocs — Multi-Process TUI

Best for interactive SSH sessions watching multiple processes side-by-side.

# Watch multiple pueue jobs in split panes
mprocs "pueue follow 3" "pueue follow 7" "pueue follow 8"

# Monitor services
mprocs "journalctl -fu opendeviationbar-sidecar" "journalctl -fu opendeviationbar-kintsugi"

Requires interactive TTY (SSH directly, not via scripts).

task-spooler (tsp) — Simplest Job Queue

Best for quick ad-hoc sequential/parallel tasks without pueue group setup.

# Queue tasks (default: 1 parallel)
tsp python train.py
tsp python validate.py

# Set parallel slots
tsp -S 3

# List queue
tsp -l

# View job output
tsp -c 0

# Combine with noti
tsp bash -c 'python train.py && noti -g -m "training done"'

Tool Selection Guide

Need to...
├── Get notified when a command finishes? → noti -g <command>
├── Send notification from a script/callback? → ntfy pub topic "msg"
├── Watch multiple logs/processes live? → mprocs "cmd1" "cmd2"
├── Quick sequential queue (no groups needed)? → tsp <command>
└── Persistent queue with groups, priorities? → pueue add <command>

Deep-Dive References

Topic	Reference
Installation (macOS, Linux, systemd, launchd)	Installation Guide
Production lessons: `--after` chaining, forensic audit, per-year parallelization, pipeline monitoring	Production Lessons
State bloat prevention, bulk xargs -P submission, two-tier architecture (300K+ jobs), crash recovery	State Management & Bulk Submission
ClickHouse thread tuning, parallelism sizing formula, live tuning	ClickHouse Tuning
Callback hooks, template variables, delayed scheduling ( `--delay` )	Callbacks & Scheduling
python-dotenv secrets pattern, rangebar-py integration	Environment & Secrets
Claude Code integration, synchronous wrapper, telemetry queries	Claude Code Integration
All `pueue.yml` settings (shared, client, daemon, profiles)	Pueue Config Reference

Pueue vs Temporal: When to Use Which

For structured, repeatable workflows needing durability and dedup, consider Temporal (

pip install temporalio

brew install temporal

Dimension	Pueue	Temporal
Best for	Ad-hoc shell commands, SSH remote jobs	Structured, repeatable workflows
Setup	Single binary, zero infra	Server + database (dev: `temporal server start-dev` )
Dedup	None — `restart` creates new tasks	Workflow ID uniqueness (built-in)
Retry policy	None — manual or external agent	Per-activity: max attempts, backoff, non-retryable errors
Parallelism	`pueue parallel N --group G`	`max_concurrent_activities=N` on worker
Visibility	`pueue status` (text, scales poorly)	Web UI with pagination, search, event history

Recommendation: Keep pueue for ad-hoc work. Migrate structured pipelines to Temporal when dedup and retry policies matter. See

devops-tools:distributed-job-safety

for invariants that apply to both.

Hook:
```
itp-hooks/posttooluse-reminder.ts
```
- Reminds to use Pueue for detected long-running commands
Reference: Pueue GitHub
Companion tools: noti (command wrapper → Telegram), ntfy (pub-sub notifications), mprocs (multi-process TUI), task-spooler (minimal job queue)
SOTA Alternative: Temporal — durable workflow orchestration with built-in dedup, retry, visibility
Issue: rangebar-py#77 - Original implementation
Issue: rangebar-py#88 - Production deployment lessons

Post-Execution Reflection

After this skill completes, check before closing:

Did the command succeed? — If not, fix the instruction or error table that caused the failure.
Did parameters or output change? — If the underlying tool's interface drifted, update Usage examples and Parameters table to match.
Was a workaround needed? — If you had to improvise (different flags, extra steps), update this SKILL.md so the next invocation doesn't need the same workaround.

Only update if the issue is real and reproducible — not speculative.

Cc-skills pueue-job-orchestration

Pueue Job Orchestration

Overview

When to Route Through Pueue

When to Use This Skill

Quick Reference

Check Status

Queue a Job

Monitor Jobs

Manage Jobs

Host Configuration

Core Workflows

1. Queue Single Remote Job

2. Batch Job Submission (Multiple Symbols)

3. Configure Parallelism Groups

4. Handle Failed Jobs

Troubleshooting

Priority Scheduling (
`--priority`
)

Per-Task Environment Override (
`pueue env`
)

Blocking Wait (
`pueue wait`
)

Script Integration Pattern

Companion CLI Tools (bigblack)

noti — Wrap Any Command with Telegram Notification

ntfy — Push Notifications from Scripts/Callbacks

mprocs — Multi-Process TUI

task-spooler (tsp) — Simplest Job Queue

Tool Selection Guide

Deep-Dive References

Pueue vs Temporal: When to Use Which

Related

Post-Execution Reflection

Cc-skills pueue-job-orchestration

Pueue Job Orchestration

Overview

When to Route Through Pueue

When to Use This Skill

Quick Reference

Check Status

Queue a Job

Monitor Jobs

Manage Jobs

Host Configuration

Core Workflows

1. Queue Single Remote Job

2. Batch Job Submission (Multiple Symbols)

3. Configure Parallelism Groups

4. Handle Failed Jobs

Troubleshooting

Priority Scheduling (--priority)

Per-Task Environment Override (pueue env)

Blocking Wait (pueue wait)

Script Integration Pattern

Companion CLI Tools (bigblack)

noti — Wrap Any Command with Telegram Notification

ntfy — Push Notifications from Scripts/Callbacks

mprocs — Multi-Process TUI

task-spooler (tsp) — Simplest Job Queue

Tool Selection Guide

Deep-Dive References

Pueue vs Temporal: When to Use Which

Related

Post-Execution Reflection

Priority Scheduling (
`--priority`
)

Per-Task Environment Override (
`pueue env`
)

Blocking Wait (
`pueue wait`
)