Gsd-skill-creator checkpoint-resume-long-job

Persist progress for long-running jobs (batched LLM calls, large ingestions, multi-hour syncs) so that a context reset, crash, or interrupt doesn't lose work. Use whenever a job iterates over N items and completing item K matters independently. Provides a resumable.mjs library pattern plus the skill's invocation heuristics.

install
source · Clone the upstream repo
git clone https://github.com/Tibsfox/gsd-skill-creator
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Tibsfox/gsd-skill-creator "$T" && mkdir -p ~/.claude/skills && cp -r "$T/examples/skills/gsd-meta/checkpoint-resume-long-job" ~/.claude/skills/tibsfox-gsd-skill-creator-checkpoint-resume-long-job && rm -rf "$T"
manifest: examples/skills/gsd-meta/checkpoint-resume-long-job/SKILL.md
source content

Checkpoint & Resume for Long Jobs

Any job that takes longer than 5 minutes and iterates over N independent items should checkpoint its progress. Context can reset, processes can crash, users can Ctrl-C. A re-run shouldn't redo completed work.

Triggers

Activate when a job:

  • Iterates over ≥ 20 items AND each item takes ≥ 5 seconds, OR
  • Is expected to run ≥ 10 minutes total, OR
  • Calls external APIs with rate limits or cost per call (LLM, HTTP), OR
  • Is not naturally idempotent at the whole-job level

Shape

The simplest checkpoint is a file listing completed item IDs. On job start: read the file; on each item completion: append its ID; on job restart: skip any ID in the file.

Reference library:
tools/checkpoint-resume/resumable.mjs

import { processBatches } from './tools/checkpoint-resume/resumable.mjs';

await processBatches({
  items: [...1713 lessons...],
  keyFn: l => l.id,
  checkpointFile: '.planning/sessions/tiebreaker-checkpoint.jsonl',
  batchSize: 5,
  async handler(batch) {
    // your per-batch work
    return batch.map(l => ({ id: l.id, status: 'done' }));
  },
  onProgress({ completed, total, skipped }) {
    console.error(`${completed + skipped}/${total} (${skipped} resumed)`);
  },
});

On first run, processes all items and appends IDs to the checkpoint file. On resume, reads the file and skips already-processed items.

Checkpoint Formats

FormatWhen
Append-only JSONLMost jobs. One line = one completed item. Easy to read, easy to resume.
Database columnWhen items already live in a DB — add
processed_at TIMESTAMP
and
WHERE processed_at IS NULL
at start.
Snapshot fileWhen checkpoint state is a complex structure (progress trees, partial outputs). Write a whole-state JSON every N items.

Prefer append-only JSONL. Crash-safe by design.

The Trade-off

Checkpointing adds file I/O per item. Usually negligible compared to the work itself. The cost of NOT checkpointing, however, is:

  • Wasted LLM calls (money)
  • Wasted API quota
  • User has to manually figure out where the job stopped
  • Worst case: job silently half-completes and corrupts DB state

Anti-patterns

  • Checkpointing to in-memory arrays only. If the process dies, so does the checkpoint.
  • Non-atomic writes. Use append-only (fsync-safe) or write-temp-then-rename.
  • Checkpoint file in
    /tmp
    .
    It WILL get cleaned up. Put it under
    .planning/sessions/
    or a project-local cache dir.
  • Not logging the checkpoint file path at start. If the user needs to resume manually, they need to know where to look.

Invocation Heuristic

Before starting any long job, ask:

  • "If my process dies halfway, is the user's work gone?"
  • "If I'm Ctrl-C'd at item 500 of 1000, can I pick up at 501?"
  • "Does item K depend on item K-1, or are they independent?"

If answers are "yes, no, independent" → use checkpointing.

Example — LLM Tiebreaker (v1.49 release-history work)

Situation: 681 lessons to classify via

claude -p
, 5 per batch, ~30 sec per batch. Total: ~70 minutes. No checkpointing was in place.

Worst-case loss: 136 wasted LLM calls at batch 137 if context broke. Actual loss: 0 — but only because the run happened to complete first time.

Fix: wrap the batch loop in

processBatches()
from
resumable.mjs
. On resume, only unprocessed lessons get classified.

Related

  • session-observatory-live
    — log a
    checkpoint
    event at every completion
  • decision-framework-invoker
    — long jobs often produce irreversible state
  • batch-rewrite-pattern
    — similar batching shape, different domain