How to handle HTTP 429 from the Claude API in production. Exponential backoff, retry-after, jitter, and the patterns that keep your traffic flowing without DDoSing yourself.
HTTP 429 from the Claude API means you have exceeded a rate-limit window — usually requests-per-minute (RPM), tokens-per-minute (TPM), or tokens-per-day (TPD) for your tier. The official SDKs auto-retry, but production traffic needs a deliberate strategy. This page is the working playbook.
HTTP/1.1 429 Too Many Requests
retry-after: 17
anthropic-ratelimit-requests-limit: 1000
anthropic-ratelimit-requests-remaining: 0
anthropic-ratelimit-requests-reset: 2026-05-19T03:18:45Z
anthropic-ratelimit-tokens-limit: 400000
anthropic-ratelimit-tokens-remaining: 0
anthropic-ratelimit-tokens-reset: 2026-05-19T03:18:32Z
{"type":"error","error":{"type":"rate_limit_error","message":"Number of request tokens has exceeded your per-minute rate limit"}}
The retry-after header is the authoritative wait time. The anthropic-ratelimit-* headers expose your real-time budget so you can throttle proactively.
import time, random
import anthropic
client = anthropic.Anthropic()
def call_with_backoff(messages, attempts=6):
for n in range(attempts):
try:
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=messages
)
except anthropic.RateLimitError as e:
wait = int(e.response.headers.get("retry-after", 2 ** n))
wait += random.uniform(0, 0.5 * wait) # full jitter
time.sleep(wait)
raise RuntimeError("Exceeded retry budget")
Three rules: (1) honour retry-after when present, (2) exponential backoff when absent, (3) add jitter so concurrent workers don't synchronise their retries and re-stampede the API.
Reactive 429-handling burns the first request that crosses the line. For high-throughput services, read anthropic-ratelimit-tokens-remaining on every response and pause new requests when it drops below your safety margin (e.g. 10% of limit). This converts 429s from errors into queued waits — a cleaner observability signal.
| Pattern | What it does | When to reach for it |
|---|---|---|
| Move to Batch API | 50% discount, separate rate-limit pool | Any non-real-time workload >1k requests/day |
| Route low-difficulty traffic to Haiku | Higher TPM ceiling per tier | Classification, extraction, simple summaries |
| Prompt caching | Cached reads still count toward TPM but cost 10%; net effect is more usable budget per dollar | Long system prompts, repeated context |
| Token-aware queueing | Schedule requests against a token-budget rather than a request-count budget | Heterogeneous request sizes |
| Tier-up | Sustained spend automatically promotes the account | Long-term ceiling, not acute spikes |
retry-after valueanthropic-ratelimit-tokens-remaining on every 2xx response (for trend graphs)Forecast whether your workload is rate-limit safe with the Claude Cost Calculator, or compare your model choice against tier ceilings on Rate Limits by Tier.