18 concrete items to verify before shipping a Claude API integration to production: keys, retries, caching, observability, fallbacks, and cost guardrails.
Most Claude API integrations break the same way: a 529 overload during a traffic spike, an API key in client-side bundles, a runaway agent burning $400 in an hour, or a model upgrade that silently changes output formats. This checklist is the 18 items we verify before any Claude integration goes live. Each item is binary — done or not done.
Authentication and key management
1. API key lives in a secret manager (AWS Secrets Manager, GCP Secret Manager, Vault) — never in .env committed to git or in client-side JS bundles.
2. A separate API key per environment (dev, staging, prod), each with its own budget cap configured in the Anthropic console.
3. Keys rotated on a schedule and on every team-member offboarding.
Resilience
4. Retries with exponential backoff and jitter on 429 (rate limit) and 529 (overloaded). The official SDKs do this; verify it isn't disabled.
5. A hard request timeout (typically 60–120s) on every call — long enough for Opus on long context, short enough to surface hangs.
6. A circuit breaker that cuts traffic to Claude if >10% of requests fail for >60s, with a static fallback response.
7. Idempotency keys on any request that triggers a side effect downstream of the model output.
Cost guardrails
8.max_tokens set conservatively per call type (e.g. 300 for chat, 1500 for code, 4000 for agents). Never the default ceiling.
9. A daily/monthly spend alert wired to PagerDuty or Slack at 50%, 80%, and 100% of budget.
10. Prompt caching enabled on every reused system prompt or tool definition larger than ~1024 tokens.
11. Per-user rate limits in your own application layer to prevent a single account from running up the bill.
Observability
12. Every request logged with: model, input tokens, output tokens, cache-read tokens, latency, and a request-correlation ID.
13. Dashboards for p50/p95/p99 latency per model and total spend per day.
14. Sampling of full prompt + response pairs (e.g. 1%) into a queryable store for offline eval and debugging.
Model and prompt safety
15. Model version pinned (e.g. claude-sonnet-4-6) — never an unversioned alias in production.
16. An eval suite of 50+ prompts that runs on every model version bump, with regression gates on accuracy and cost.
17. Input validation and output schema validation. Reject obviously malicious prompts at the edge; parse model output against a Pydantic / Zod schema and re-prompt or fail closed on schema violations.
18. A documented rollback path: how to swap models, disable Claude entirely, and serve a fallback if Anthropic has an outage.
What is the most overlooked Claude API production issue?
Unpinned model versions. Teams ship with whatever the docs example used, then a model update silently changes output formatting or token counts and breaks downstream parsers. Always pin the exact version string (e.g. claude-sonnet-4-6) and gate version bumps on a full eval-suite pass.
Do I need a circuit breaker for the Claude API?
If Claude is on a critical path (user-facing chat, agent runtime), yes. Anthropic occasionally has region-wide capacity events that trigger 529 storms; a circuit breaker that opens for 30–60 seconds after a fault rate threshold prevents your service from cascading the failure to users.
How do I prevent runaway Claude API spend?
Three layers: (1) a per-call max_tokens cap, (2) per-user rate limits at the application layer, and (3) Anthropic console budget alerts at 50/80/100% of monthly spend wired to PagerDuty. Most production cost incidents are agents looping without a max-iteration guard — cap agent depth explicitly.
Should I retry on Claude API 529 errors?
Yes — 529 (overloaded) is retriable with exponential backoff. The official SDKs retry up to 2 times by default; increase to 4–6 with full jitter in latency-tolerant paths. Do not retry on 4xx errors other than 429, since those indicate request-side bugs that retrying won't fix.