Skilllibrary background-jobs-queues
Designs and implements asynchronous job processing with queues, workers, retries, and scheduling. Use when implementing Celery tasks, BullMQ processors, Sidekiq workers, configuring retry policies, setting up dead-letter queues, designing idempotent jobs, scheduling periodic/cron tasks, or troubleshooting stuck/failed jobs.
git clone https://github.com/merceralex397-collab/skilllibrary
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/09-backend-api-and-data/background-jobs-queues" ~/.claude/skills/merceralex397-collab-skilllibrary-background-jobs-queues && rm -rf "$T"
09-backend-api-and-data/background-jobs-queues/SKILL.mdPurpose
Designs and implements asynchronous job processing: queue systems (Celery, BullMQ, Sidekiq, Hangfire), message brokers (Redis, RabbitMQ, SQS), job lifecycle management (enqueue → process → retry → dead-letter), idempotency, scheduled/periodic jobs, worker concurrency, monitoring, and graceful shutdown.
When to use this skill
Use this skill when:
- Implementing Celery tasks, BullMQ processors, Sidekiq workers, or Hangfire jobs
- Configuring retry policies (max attempts, exponential backoff, jitter)
- Setting up dead-letter queues for failed jobs
- Designing idempotent jobs (safe to re-run without side effects)
- Configuring scheduled/periodic tasks (celerybeat, node-cron, Sidekiq-Cron)
- Scaling workers (concurrency, prefetch, autoscaling)
- Implementing job chaining, workflows, or fan-out/fan-in patterns
- Debugging stuck, zombie, or poison-message jobs
- Setting up queue monitoring and alerting (queue depth, job duration, failure rate)
Do not use this skill when
- The task is about synchronous API request handling — prefer
or framework-specific skillsapi-contracts - The task is about real-time bidirectional communication — prefer
realtime-websocket - The task is about database-level scheduling (pg_cron, MySQL events) — prefer
orpostgresqlsqlite - The task is about rate limiting API endpoints — prefer
rate-limits-retries
Operating procedure
-
Identify the job and its trigger. Determine what initiates the job (API request, schedule, event, another job). Define the job's input payload — keep it small and serializable (IDs, not full objects). Document the expected duration and resource usage.
-
Choose the queue system and broker. Match to the project stack:
- Python → Celery + Redis/RabbitMQ, or Django-Q, or Dramatiq
- Node.js → BullMQ + Redis, or SQS consumer
- Ruby → Sidekiq + Redis, or GoodJob (Postgres-backed)
- .NET → Hangfire + SQL Server/Redis, or MassTransit + RabbitMQ
- Cloud-native → AWS SQS + Lambda, GCP Cloud Tasks, Azure Queue Storage
-
Implement the job as an idempotent function. The job must produce the same result if executed multiple times with the same input. Use idempotency keys: store the key before processing, check for existence before executing side effects. Return early if already processed.
-
Configure retry policy. Set max retries (typically 3–5 for transient failures). Use exponential backoff with jitter:
. Classify errors: transient (network timeout, 503) → retry; permanent (400, validation error) → send to dead-letter immediately.delay = base * 2^attempt + random(0, jitter) -
Set up dead-letter queue (DLQ). Jobs that exhaust all retries go to the DLQ. The DLQ must be monitored — alert if depth > 0. Build a reprocessing mechanism: inspect failed job, fix the root cause, replay from DLQ.
-
Configure worker concurrency and resources. Set concurrency based on job type: I/O-bound jobs → higher concurrency (e.g., Celery
); CPU-bound jobs → match to available cores. Set memory limits per worker. Configure prefetch to control how many jobs a worker pulls ahead.--concurrency=16 -
Implement graceful shutdown. Workers must finish in-flight jobs before stopping. Configure shutdown timeout (e.g., Celery
, BullMQ--soft-time-limit
). Kubernetes: setconnection.disconnect()
> max job duration.terminationGracePeriodSeconds -
Add monitoring and alerting. Track: queue depth (jobs waiting), job duration (p50, p95, p99), failure rate, DLQ depth. Tools: Flower (Celery), Bull Board (BullMQ), Sidekiq Web UI, Prometheus + Grafana. Alert on: queue depth growing, DLQ non-empty, job duration exceeding SLA.
-
Test the job. Unit test the job function with mocked dependencies. Integration test: enqueue → process → verify side effects. Test retry behavior: simulate transient failure, verify retry count and backoff. Test idempotency: run the same job twice, verify no duplicate side effects.
Decision rules
- Jobs must be idempotent. If a job cannot be made idempotent, it must use exactly-once semantics with distributed locking — and you must document why idempotency is impossible.
- Payloads must be small. Pass IDs and lookup the data inside the job. Never pass large objects, file contents, or database rows in the payload. This prevents serialization failures and stale data.
- Retry only transient failures. Permanent failures (validation errors, missing records, authorization failures) should go directly to the DLQ — retrying them wastes resources and delays the alert.
- Every queue has a DLQ. No exceptions. An unmonitored DLQ is the same as no DLQ.
- Scheduled jobs need overlap protection. If a cron job runs every 5 minutes but takes 7 minutes, you'll get overlapping executions. Use a distributed lock (Redis
) or the framework's built-in overlap prevention.SET NX EX - Prefer at-least-once delivery. Exactly-once is expensive and fragile. Design for at-least-once with idempotency.
- Workers must not store state in memory across jobs. Each job execution starts clean. Shared state goes in the database or cache.
Output requirements
— function signature, input payload schema, expected durationJob Definition
— broker, queue name, concurrency, retry policyQueue Configuration
— how duplicate execution is preventedIdempotency Strategy
— dead-letter queue setup and reprocessing procedureDLQ Plan
— metrics tracked and alert thresholdsMonitoring
References
Read these when the task involves the relevant pattern:
— Celery task patterns, BullMQ processors, Sidekiq workers, job chaining, scheduled jobs, idempotency keys, DLQ processingreferences/implementation-patterns.md
— idempotency verification, retry policy, DLQ monitoring, graceful shutdown, payload sizereferences/validation-checklist.md
— poison messages, OOM workers, retry storms, lost jobs, zombie workersreferences/failure-modes.md
Anti-patterns
- The fire-and-forget job. Enqueuing a job with no retry policy, no DLQ, and no monitoring. The job fails silently and nobody notices for days.
- Fat payloads. Passing entire database rows, file contents, or HTML in the job payload. Payload grows, serialization breaks, broker memory spikes.
- Non-idempotent side effects. Sending an email on every retry attempt. Charging a credit card twice. Use idempotency keys or deduplication.
- Unbounded retries.
ormax_retries=None
. A permanently failing job retries forever, consuming worker capacity.retries: Infinity - Synchronous job in the request path. Enqueuing a job and then polling for its result in the HTTP response. This is a synchronous call with extra steps. Use a webhook or polling endpoint instead.
- Global worker for all queues. One worker process consuming from every queue. A slow job in the
queue blocks time-sensitive jobs in theemail
queue. Use dedicated workers per queue or priority queues.payment - Cron job without overlap protection. A scheduled job that takes longer than its interval, causing pile-up. Use locking or skip-if-running logic.
Related skills
— retry strategies, backoff patterns, circuit breakersrate-limits-retries
— structured logging, metrics, alertingobservability-logging
— database access patterns within jobsorm-patterns
— schema for job state tables, idempotency key storagedata-model
— event-driven job triggerswebhooks-events
Failure handling
- If the broker (Redis/RabbitMQ) is unreachable: Jobs cannot be enqueued. The application must handle this gracefully — either queue to an in-memory fallback with disk persistence, or return an error to the caller. Never silently drop the job.
- If a worker crashes mid-job: The job must be re-delivered by the broker (visibility timeout expires, message is re-queued). The job must be idempotent to handle this safely. Check the broker's acknowledgement settings — auto-ack before processing means lost jobs on crash.
- If the DLQ fills up: This is an operational emergency, not a normal state. Page the on-call engineer. Do not auto-retry from the DLQ without human inspection.
- If job duration exceeds the timeout: The worker kills the job (Celery
, BullMQ stalled job recovery). Ensure the job is designed to be re-entrant or use checkpointing for long-running work.SoftTimeLimitExceeded - If you cannot determine whether a job is idempotent: Assume it is not. Add an idempotency key before deploying. Test by running the job twice with the same payload and verifying no duplicate side effects.