Skillshub anth-prod-checklist

install

source · Clone the upstream repo

git clone https://github.com/ComeOnOliver/skillshub

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/anth-prod-checklist" ~/.claude/skills/comeonoliver-skillshub-anth-prod-checklist && rm -rf "$T"

manifest: skills/jeremylongshore/claude-code-plugins-plus-skills/anth-prod-checklist/SKILL.md

Anthropic Production Checklist

Overview

Complete checklist for deploying Claude API integrations to production with reliability, observability, and cost controls.

Pre-Launch Checklist

Authentication & Keys

Production API key from dedicated Workspace
Key stored in secret manager (not env files on servers)
Key rotation procedure documented and tested
Separate keys for each environment (dev/staging/prod)

Error Handling

All 5 error types handled:

authentication_error

invalid_request_error

rate_limit_error

api_error

overloaded_error

SDK
```
maxRetries
```
set (recommended: 3-5 for production)
Custom error logging with
```
request-id
```
captured
Circuit breaker for sustained API failures

Rate Limits & Cost

Usage tier verified at console.anthropic.com
Application-level rate limiting implemented
Cost alerts configured (monthly spend caps)
Model selection optimized (Haiku for simple tasks, Sonnet for complex)
```
max_tokens
```
set to realistic values (not inflated)
Prompt caching enabled for repeated system prompts

Reliability

Timeout configured (
```
timeout
```
parameter, recommended 60-120s)
Graceful degradation when API is unavailable
Health check endpoint tests API connectivity

async def health_check():
    try:
        # Use token counting as a cheap health probe (no generation cost)
        count = client.messages.count_tokens(
            model="claude-haiku-4-20250514",
            messages=[{"role": "user", "content": "ping"}]
        )
        return {"status": "healthy", "tokens": count.input_tokens}
    except Exception as e:
        return {"status": "degraded", "error": str(e)}

Observability

Request/response logging (redact content, keep metadata)
Latency tracking (p50, p95, p99)
Token usage tracking (input + output per request)
Cost tracking per feature/customer
Error rate alerting (429s, 5xx, timeouts)

import logging
import time

logger = logging.getLogger("anthropic")

def tracked_create(**kwargs):
    start = time.monotonic()
    try:
        response = client.messages.create(**kwargs)
        duration = time.monotonic() - start
        logger.info(
            "claude_request",
            extra={
                "request_id": response._request_id,
                "model": response.model,
                "input_tokens": response.usage.input_tokens,
                "output_tokens": response.usage.output_tokens,
                "duration_ms": int(duration * 1000),
                "stop_reason": response.stop_reason,
            }
        )
        return response
    except Exception as e:
        duration = time.monotonic() - start
        logger.error("claude_error", extra={"error": str(e), "duration_ms": int(duration * 1000)})
        raise

Content Safety

System prompts reviewed for injection resistance
User input validated and length-limited
Output scanned for sensitive data leakage
Content moderation for user-facing responses

Infrastructure

Deployment uses canary/rolling strategy
Rollback procedure documented and tested
Runbook created (see
```
anth-incident-runbook
```
)
On-call escalation path defined

Alerting Thresholds

Metric	Warning	Critical
Error rate (5xx)	> 1%	> 5%
p99 latency	> 10s	> 30s
429 rate	> 5/min	> 20/min
Daily cost	> 80% budget	> 100% budget
Auth failures (401/403)	> 0	> 0 (immediate)

Resources

Next Steps

For version upgrades, see

anth-upgrade-migration