Gsd-skill-creator checkpoint-resume-long-job
Persist progress for long-running jobs (batched LLM calls, large ingestions, multi-hour syncs) so that a context reset, crash, or interrupt doesn't lose work. Use whenever a job iterates over N items and completing item K matters independently. Provides a resumable.mjs library pattern plus the skill's invocation heuristics.
git clone https://github.com/Tibsfox/gsd-skill-creator
T=$(mktemp -d) && git clone --depth=1 https://github.com/Tibsfox/gsd-skill-creator "$T" && mkdir -p ~/.claude/skills && cp -r "$T/examples/skills/gsd-meta/checkpoint-resume-long-job" ~/.claude/skills/tibsfox-gsd-skill-creator-checkpoint-resume-long-job && rm -rf "$T"
examples/skills/gsd-meta/checkpoint-resume-long-job/SKILL.mdCheckpoint & Resume for Long Jobs
Any job that takes longer than 5 minutes and iterates over N independent items should checkpoint its progress. Context can reset, processes can crash, users can Ctrl-C. A re-run shouldn't redo completed work.
Triggers
Activate when a job:
- Iterates over ≥ 20 items AND each item takes ≥ 5 seconds, OR
- Is expected to run ≥ 10 minutes total, OR
- Calls external APIs with rate limits or cost per call (LLM, HTTP), OR
- Is not naturally idempotent at the whole-job level
Shape
The simplest checkpoint is a file listing completed item IDs. On job start: read the file; on each item completion: append its ID; on job restart: skip any ID in the file.
Reference library: tools/checkpoint-resume/resumable.mjs
tools/checkpoint-resume/resumable.mjsimport { processBatches } from './tools/checkpoint-resume/resumable.mjs'; await processBatches({ items: [...1713 lessons...], keyFn: l => l.id, checkpointFile: '.planning/sessions/tiebreaker-checkpoint.jsonl', batchSize: 5, async handler(batch) { // your per-batch work return batch.map(l => ({ id: l.id, status: 'done' })); }, onProgress({ completed, total, skipped }) { console.error(`${completed + skipped}/${total} (${skipped} resumed)`); }, });
On first run, processes all items and appends IDs to the checkpoint file. On resume, reads the file and skips already-processed items.
Checkpoint Formats
| Format | When |
|---|---|
| Append-only JSONL | Most jobs. One line = one completed item. Easy to read, easy to resume. |
| Database column | When items already live in a DB — add and at start. |
| Snapshot file | When checkpoint state is a complex structure (progress trees, partial outputs). Write a whole-state JSON every N items. |
Prefer append-only JSONL. Crash-safe by design.
The Trade-off
Checkpointing adds file I/O per item. Usually negligible compared to the work itself. The cost of NOT checkpointing, however, is:
- Wasted LLM calls (money)
- Wasted API quota
- User has to manually figure out where the job stopped
- Worst case: job silently half-completes and corrupts DB state
Anti-patterns
- Checkpointing to in-memory arrays only. If the process dies, so does the checkpoint.
- Non-atomic writes. Use append-only (fsync-safe) or write-temp-then-rename.
- Checkpoint file in
. It WILL get cleaned up. Put it under/tmp
or a project-local cache dir..planning/sessions/ - Not logging the checkpoint file path at start. If the user needs to resume manually, they need to know where to look.
Invocation Heuristic
Before starting any long job, ask:
- "If my process dies halfway, is the user's work gone?"
- "If I'm Ctrl-C'd at item 500 of 1000, can I pick up at 501?"
- "Does item K depend on item K-1, or are they independent?"
If answers are "yes, no, independent" → use checkpointing.
Example — LLM Tiebreaker (v1.49 release-history work)
Situation: 681 lessons to classify via
claude -p, 5 per batch, ~30 sec per
batch. Total: ~70 minutes. No checkpointing was in place.
Worst-case loss: 136 wasted LLM calls at batch 137 if context broke. Actual loss: 0 — but only because the run happened to complete first time.
Fix: wrap the batch loop in
processBatches() from resumable.mjs.
On resume, only unprocessed lessons get classified.
Related
— log asession-observatory-live
event at every completioncheckpoint
— long jobs often produce irreversible statedecision-framework-invoker
— similar batching shape, different domainbatch-rewrite-pattern