Claude-code-plugins-plus-skills langchain-prod-checklist

install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/langchain-pack/skills/langchain-prod-checklist" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-langchain-prod-checklist && rm -rf "$T"
manifest: plugins/saas-packs/langchain-pack/skills/langchain-prod-checklist/SKILL.md
source content

LangChain Production Checklist

Overview

Comprehensive go-live checklist for deploying LangChain applications to production. Covers configuration, resilience, observability, performance, security, testing, deployment, and cost management.

1. Configuration & Secrets

  • All API keys in secrets manager (not
    .env
    in production)
  • Environment-specific configs (dev/staging/prod) validated with Zod
  • Startup validation fails fast on missing config
  • .env
    files in
    .gitignore
// Startup validation
import { z } from "zod";

const ProdConfig = z.object({
  OPENAI_API_KEY: z.string().startsWith("sk-"),
  LANGSMITH_API_KEY: z.string().startsWith("lsv2_"),
  NODE_ENV: z.literal("production"),
});

try {
  ProdConfig.parse(process.env);
} catch (e) {
  console.error("Invalid production config:", e);
  process.exit(1);
}

2. Error Handling & Resilience

  • maxRetries
    configured on all models (3-5)
  • timeout
    set on all models (30-60s)
  • Fallback models configured with
    .withFallbacks()
  • Error responses return safe messages (no stack traces to users)
const model = new ChatOpenAI({
  model: "gpt-4o-mini",
  maxRetries: 5,
  timeout: 30000,
}).withFallbacks({
  fallbacks: [new ChatAnthropic({ model: "claude-sonnet-4-20250514" })],
});

3. Observability

  • LangSmith tracing enabled (
    LANGSMITH_TRACING=true
    )
  • LANGCHAIN_CALLBACKS_BACKGROUND=true
    (non-serverless only)
  • Structured logging on all LLM/tool calls
  • Prometheus metrics exported (requests, latency, tokens, errors)
  • Alerting rules configured (error rate >5%, P95 latency >5s)

4. Performance

  • Caching enabled for repeated queries (Redis or SQLite)
  • maxConcurrency
    set on batch operations
  • Streaming enabled for user-facing responses
  • Connection pooling configured
  • Prompt length optimized (no unnecessary verbosity)

5. Security

  • User input isolated in human messages (never in system prompts)
  • Input length limits enforced
  • Prompt injection patterns logged/flagged
  • Tools restricted to allowlisted operations
  • LLM output validated before display (no PII/key leakage)
  • Audit logging on all LLM and tool calls
  • Rate limiting per user/IP

6. Testing

  • Unit tests for all chains (using
    FakeListChatModel
    , no API calls)
  • Integration tests with real LLMs (gated behind CI secrets)
  • RAG pipeline validation (retrieval relevance + no hallucination)
  • Tool unit tests (valid input, invalid input, error cases)
  • Load testing completed (concurrent users, batch operations)

7. Deployment

  • Health check endpoint returns LLM connectivity status
  • Graceful shutdown handles in-flight requests
  • Rolling deployment (zero downtime)
  • Rollback procedure documented and tested
  • Container resource limits set (memory, CPU)
// Health check endpoint
app.get("/health", async (_req, res) => {
  const checks: Record<string, string> = { server: "ok" };

  try {
    await model.invoke("ping");
    checks.llm = "ok";
  } catch (e: any) {
    checks.llm = `error: ${e.message.slice(0, 100)}`;
  }

  const healthy = Object.values(checks).every((v) => v === "ok");
  res.status(healthy ? 200 : 503).json({ status: healthy ? "healthy" : "degraded", checks });
});

// Graceful shutdown
process.on("SIGTERM", async () => {
  console.log("Shutting down gracefully...");
  server.close(() => process.exit(0));
  setTimeout(() => process.exit(1), 10000); // force after 10s
});

8. Cost Management

  • Token usage tracking callback attached
  • Daily/monthly budget limits enforced
  • Model tiering: cheap model for simple tasks, powerful for complex
  • Cost alerts configured (Slack/email on threshold)
  • Cost per user/tenant tracked

Pre-Launch Validation Script

async function validateProduction() {
  const results: Record<string, string> = {};

  // 1. Config
  try {
    ProdConfig.parse(process.env);
    results["Config"] = "PASS";
  } catch { results["Config"] = "FAIL: missing env vars"; }

  // 2. LLM connectivity
  try {
    await model.invoke("ping");
    results["LLM"] = "PASS";
  } catch (e: any) { results["LLM"] = `FAIL: ${e.message.slice(0, 50)}`; }

  // 3. Fallback
  try {
    const fallbackModel = model.withFallbacks({ fallbacks: [fallback] });
    await fallbackModel.invoke("ping");
    results["Fallback"] = "PASS";
  } catch { results["Fallback"] = "FAIL"; }

  // 4. LangSmith
  results["LangSmith"] = process.env.LANGSMITH_TRACING === "true" ? "PASS" : "WARN: disabled";

  // 5. Health endpoint
  try {
    const res = await fetch("http://localhost:8000/health");
    results["Health"] = res.ok ? "PASS" : "FAIL";
  } catch { results["Health"] = "FAIL: not reachable"; }

  console.table(results);
  const allPass = Object.values(results).every((v) => v === "PASS");
  console.log(allPass ? "READY FOR PRODUCTION" : "ISSUES FOUND - FIX BEFORE LAUNCH");
  return allPass;
}

Error Handling

IssueCauseFix
API key missing at startupSecrets not mountedCheck deployment config
No fallback on outage
.withFallbacks()
not configured
Add fallback model
LangSmith trace gapsBackground callbacks in serverlessSet
LANGCHAIN_CALLBACKS_BACKGROUND=false
Cache miss stormRedis downImplement graceful degradation

Resources

Next Steps

After launch, use

langchain-observability
for monitoring and
langchain-incident-runbook
for incident response.