git clone https://github.com/Intense-Visions/harness-engineering
T=$(mktemp -d) && git clone --depth=1 https://github.com/Intense-Visions/harness-engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agents/skills/codex/api-long-running-operations" ~/.claude/skills/intense-visions-harness-engineering-api-long-running-operations-fd73f8 && rm -rf "$T"
agents/skills/codex/api-long-running-operations/SKILL.mdAPI Long-Running Operations
LONG-RUNNING OPERATIONS REQUIRE AN EXPLICIT ASYNC CONTRACT — A 202 ACCEPTED RESPONSE WITH AN OPERATION RESOURCE THAT CLIENTS CAN POLL OR SUBSCRIBE TO IS THE DIFFERENCE BETWEEN A SYNCHRONOUS BOTTLENECK THAT TIMES OUT UNDER LOAD AND A SCALABLE PATTERN WHERE SERVERS PROCESS WORK INDEPENDENTLY OF CLIENT CONNECTION LIFETIME.
When to Use
- Designing an API endpoint that performs work exceeding typical HTTP timeout windows (30–60 seconds): video transcoding, report generation, bulk data export, machine learning inference
- Replacing a synchronous endpoint that causes client timeouts or gateway 502/504 errors under load
- Choosing between polling and webhook/callback notification for an async operation consumer
- Writing the long-running operations section of an API style guide or developer portal
- Auditing an existing API for synchronous endpoints that should be converted to async patterns
- Implementing an operation status resource that supports cancellation and progress reporting
Instructions
Key Concepts
-
202 Accepted and the operation resource — When a client submits a request that will take longer than a few seconds, the server immediately returns
with a URL in the202 Accepted
header pointing to an operation resource that represents the in-progress work. The operation resource is a first-class API resource with its own URL, creation timestamp, status, and eventually a result or error. This decouples the client connection from the execution duration: the client disconnects after receiving 202 and reconnects later to check status. Example:Location
.Location: /operations/op_abc123 -
Operation resource schema — Every operation resource must include:
(unique operation ID),id
(one ofstatus
,pending
,running
,succeeded
,failed
),cancelled
(ISO 8601),created_at
, and eitherupdated_at
(on success) orresult
(on failure). Optionally includeerror
(0–100 integer),progress
(ISO 8601), andestimated_completion
(client-supplied context echoed back). Google's AIP-151 defines the canonical operation resource schema:metadata
. Stripe's{ "name": "operations/abc123", "done": false, "metadata": { "@type": "...", "progress": 42 } }
async pattern and PayPal'sFileLink
resource follow the same structure.PAYOUT -
Polling design — status endpoint — The client polls
at an interval to check progress. The response includes the current status. WhenGET /operations/{id}
isstatus
, the response includes the result or a URL to retrieve it. Whensucceeded
isstatus
, the response includes a structured error. Best practices: returnfailed
on 202 and on intermediate polling responses to tell the client the suggested polling interval; use exponential backoff in clients (start at 1 second, cap at 30 seconds); document the maximum expected operation duration so clients know when to give up.Retry-After -
Webhook/callback notification — Instead of polling, clients can register a callback URL on the operation creation request:
. The server sends a webhook delivery to the callback URL when the operation reaches a terminal state (succeeded, failed, cancelled). This eliminates polling traffic and delivers results with lower latency. The callback payload should match the operation resource schema. Implement the same signature verification and retry policy as the webhook system (see api-webhook-design). Provide both polling and callback options: some consumers prefer polling for simplicity; others prefer callbacks for latency.{ "callback_url": "https://client.example/hooks/operation-complete" } -
Cancellation — Operations that are still pending or running should support cancellation:
returns 200 with the updated operation resource inPOST /operations/{id}/cancel
state, or 409 if the operation has already reached a terminal state. Cancellation is best-effort: an operation that is in the final stage of execution may complete before the cancellation is processed. Document this clearly: "Cancellation is best-effort; a cancel request does not guarantee the operation will not complete."cancelled -
Idempotency and operation deduplication — Operation creation requests should accept an
header. If a client creates an operation with a key, times out before receiving the 202, and retries with the same key, the server must return the existing operation resource (not start a new one). Without idempotency support on operation creation, a client timeout results in multiple copies of the same work running concurrently. See api-idempotency-keys for key management details.Idempotency-Key
Worked Example
Google Cloud Vision API — async document text detection (AIP-151)
Submit a long-running text detection job:
POST /v1/files:asyncBatchAnnotate Authorization: Bearer ya29.xxx Content-Type: application/json { "requests": [{ "inputConfig": { "gcsSource": { "uri": "gs://my-bucket/document.pdf" }, "mimeType": "application/pdf" }, "features": [{ "type": "DOCUMENT_TEXT_DETECTION" }], "outputConfig": { "gcsDestination": { "uri": "gs://my-bucket/output/" } } }] } → HTTP/1.1 200 OK { "name": "projects/my-project/operations/abc123def456", "metadata": { "@type": "type.googleapis.com/google.cloud.vision.v1.OperationMetadata", "state": "CREATED", "createTime": "2024-04-10T12:00:00Z" }, "done": false }
Poll for completion:
GET /v1/projects/my-project/operations/abc123def456 Authorization: Bearer ya29.xxx → HTTP/1.1 200 OK { "name": "projects/my-project/operations/abc123def456", "metadata": { "state": "RUNNING", "updateTime": "2024-04-10T12:00:05Z" }, "done": false }
Terminal success:
→ HTTP/1.1 200 OK { "name": "projects/my-project/operations/abc123def456", "metadata": { "state": "DONE", "updateTime": "2024-04-10T12:00:45Z" }, "done": true, "response": { "@type": "type.googleapis.com/google.cloud.vision.v1.AsyncBatchAnnotateFilesResponse", "responses": [{ "outputConfig": { "gcsDestination": { "uri": "gs://my-bucket/output/" } } }] } }
Terminal failure:
{ "name": "projects/my-project/operations/abc123def456", "done": true, "error": { "code": 5, "message": "Input file not found: gs://my-bucket/document.pdf", "status": "NOT_FOUND" } }
REST API pattern (non-Google) — bulk export with callback:
POST /v1/exports Authorization: Bearer token_xxx Idempotency-Key: 7f3a9b2c-1e4d-4f8a-9c3b-2e5f6a7d8e9f Content-Type: application/json { "format": "csv", "date_range": { "start": "2024-01-01", "end": "2024-03-31" }, "callback_url": "https://app.acme.com/hooks/export-complete" } → HTTP/1.1 202 Accepted Location: /v1/operations/op_7x9bQ3mR Retry-After: 30 Content-Type: application/json { "id": "op_7x9bQ3mR", "status": "pending", "created_at": "2024-04-10T12:00:00Z", "updated_at": "2024-04-10T12:00:00Z", "estimated_completion": "2024-04-10T12:02:00Z" }
When complete, the server sends a callback delivery to
https://app.acme.com/hooks/export-complete:
{ "id": "op_7x9bQ3mR", "status": "succeeded", "created_at": "2024-04-10T12:00:00Z", "updated_at": "2024-04-10T12:01:47Z", "result": { "download_url": "https://storage.acme.com/exports/op_7x9bQ3mR.csv", "expires_at": "2024-04-11T12:01:47Z", "row_count": 142350 } }
Anti-Patterns
-
Blocking the HTTP connection for the full operation duration. Holding a connection open for minutes while a background job completes ties up server threads/connections, triggers load balancer and gateway timeouts (typically 30–60 seconds), and provides no recoverability if the connection drops mid-job. Return 202 immediately with a Location header to the operation resource; process the work asynchronously.
-
No operation resource — polling a state field on the original resource. Returning
on the original resource URL conflates the operation state with the resource state. The resource URL represents the entity (the export, the report); the operation URL represents the specific execution. Use a separate operation resource so the same entity can have multiple historical operations without resource state ambiguity.{ "status": "processing" } -
Omitting
on polling responses. Without aRetry-After
hint, clients implement their own polling intervals — often too aggressively (every second) or too conservatively (every 5 minutes). A bulk export that completes in 90 seconds receives either 90 unnecessary poll requests or delivers results 4 minutes late. IncludeRetry-After
on the 202 and on intermediate poll responses.Retry-After -
Terminal state responses that do not include the result or error inline. A polling response that says
but requires a second request to retrieve the result adds latency and API round-trips. When the operation is complete, include the result or a download URL in the same response that reportsstatus: succeeded
. Only use a separate result endpoint if the result is too large to include inline (e.g., a multi-gigabyte file reference).succeeded -
No cancellation support. Operations that cannot be cancelled force clients who submit erroneous or duplicate jobs to wait for completion before they can retry correctly. Long-running jobs that process large datasets waste significant compute if they cannot be stopped early. Always implement
for operations that run for more than a few seconds.POST /operations/{id}/cancel
Details
AIP-151 and the Google Long-Running Operations Standard
Google's API Improvement Proposal AIP-151 defines the authoritative specification for long-running operations across all Google Cloud APIs. The key requirements:
- The operation resource name follows the pattern
.{collection}/operations/{id} - The
boolean field (false = in progress, true = terminal) must be present on every response.done - On terminal success, a
field contains the result typed with the full protobuf type URL.response - On terminal failure, an
field contains aerror
with a gRPC status code, message, and optionalgoogle.rpc.Status
array.details - Operations must support a
method:cancel
.POST {name}:cancel - Operations must support a
method after reaching a terminal state.delete
For REST APIs not using protobuf, the AIP-151 schema maps naturally:
done becomes "status": "succeeded" | "failed", the response and error fields remain, and the name field becomes id with a path-based URL.
Real-World Case Study: Twilio Media Processing
Twilio's Media Content API processes uploaded audio and video files asynchronously. When a customer uploads a recording for transcription, Twilio returns an operation resource immediately and sends a status callback to the registered URL when transcription completes. Twilio's published data shows that P50 transcription latency is 8 seconds for short recordings and P99 exceeds 90 seconds for recordings longer than 1 hour.
Before implementing the async pattern, Twilio's synchronous transcription endpoint had a 30-second hard timeout enforced by their API gateway. Recordings longer than a few minutes consistently returned 504 Gateway Timeout. After migrating to the 202 + operation resource pattern, timeout-related support tickets dropped by 99%. The callback mechanism reduced average time-to-result for customers from the polling interval (up to 30 seconds) to under 2 seconds for most recordings.
Source
- Google AIP-151 — Long-Running Operations
- Google Cloud Operations Reference
- Stripe — File Links and Async Patterns
- Microsoft REST API Guidelines — Long Running Operations
- PayPal Async APIs
Process
- Identify endpoints where P99 execution time exceeds 10 seconds; these are candidates for the 202 + operation resource pattern.
- Define the operation resource schema with
,id
,status
,created_at
, and terminal-stateupdated_at
/result
fields; publish the schema in developer documentation.error - Return
with a202 Accepted
header and aLocation
hint immediately after enqueueing the background job; do not wait for execution to begin.Retry-After - Implement
for polling and optionally acceptGET /operations/{id}
on the creation request for push notification; supportcallback_url
for in-progress operations.POST /operations/{id}/cancel - Run
to confirm skill files are well-formed and related skills are correctly cross-referenced.harness validate
Harness Integration
- Type: knowledge — this skill is a reference document, not a procedural workflow.
- No tools or state — consumed as context by other skills and agents.
- related_skills: api-webhook-design, api-status-codes, api-http-methods, api-error-contracts
Success Criteria
- Endpoints with P99 execution time exceeding 10 seconds return 202 Accepted with a
header pointing to an operation resource — not a synchronous response.Location - The operation resource schema includes
,id
,status
,created_at
, andupdated_at
orresult
fields in terminal states.error - Polling responses include a
header indicating the suggested polling interval.Retry-After - Cancellation is supported via
for operations inPOST /operations/{id}/cancel
orpending
state.running - Operation creation accepts an
header so clients can safely retry creation requests after a timeout.Idempotency-Key