Claude API 429 Rate Limit: Backoff, Retry, and Survival (2026)

Claude API 429 — Rate-Limit Survival Guide

How to handle HTTP 429 from the Claude API in production. Exponential backoff, retry-after, jitter, and the patterns that keep your traffic flowing without DDoSing yourself.

HTTP 429 from the Claude API means you have exceeded a rate-limit window — usually requests-per-minute (RPM), tokens-per-minute (TPM), or tokens-per-day (TPD) for your tier. The official SDKs auto-retry, but production traffic needs a deliberate strategy. This page is the working playbook.

What a 429 response actually contains

HTTP/1.1 429 Too Many Requests retry-after: 17 anthropic-ratelimit-requests-limit: 1000 anthropic-ratelimit-requests-remaining: 0 anthropic-ratelimit-requests-reset: 2026-05-19T03:18:45Z anthropic-ratelimit-tokens-limit: 400000 anthropic-ratelimit-tokens-remaining: 0 anthropic-ratelimit-tokens-reset: 2026-05-19T03:18:32Z {"type":"error","error":{"type":"rate_limit_error","message":"Number of request tokens has exceeded your per-minute rate limit"}}

The retry-after header is the authoritative wait time. The anthropic-ratelimit-* headers expose your real-time budget so you can throttle proactively.

The minimum-correct backoff (Python)

import time, random import anthropic client = anthropic.Anthropic() def call_with_backoff(messages, attempts=6): for n in range(attempts): try: return client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, messages=messages ) except anthropic.RateLimitError as e: wait = int(e.response.headers.get("retry-after", 2 ** n)) wait += random.uniform(0, 0.5 * wait) # full jitter time.sleep(wait) raise RuntimeError("Exceeded retry budget")

Three rules: (1) honour retry-after when present, (2) exponential backoff when absent, (3) add jitter so concurrent workers don't synchronise their retries and re-stampede the API.

Proactive throttling using ratelimit headers

Reactive 429-handling burns the first request that crosses the line. For high-throughput services, read anthropic-ratelimit-tokens-remaining on every response and pause new requests when it drops below your safety margin (e.g. 10% of limit). This converts 429s from errors into queued waits — a cleaner observability signal.

Five patterns that actually fix repeated 429s

Pattern	What it does	When to reach for it
Move to Batch API	50% discount, separate rate-limit pool	Any non-real-time workload >1k requests/day
Route low-difficulty traffic to Haiku	Higher TPM ceiling per tier	Classification, extraction, simple summaries
Prompt caching	Cached reads still count toward TPM but cost 10%; net effect is more usable budget per dollar	Long system prompts, repeated context
Token-aware queueing	Schedule requests against a token-budget rather than a request-count budget	Heterogeneous request sizes
Tier-up	Sustained spend automatically promotes the account	Long-term ceiling, not acute spikes

Pattern

What it does

When to reach for it

Move to Batch API

50% discount, separate rate-limit pool

Any non-real-time workload >1k requests/day

Route low-difficulty traffic to Haiku

Higher TPM ceiling per tier

Classification, extraction, simple summaries

Prompt caching

Cached reads still count toward TPM but cost 10%; net effect is more usable budget per dollar

Long system prompts, repeated context

Token-aware queueing

Schedule requests against a token-budget rather than a request-count budget

Heterogeneous request sizes

Tier-up

Sustained spend automatically promotes the account

Long-term ceiling, not acute spikes

What NOT to do

Observability — what to log

Frequently asked questions

What is the retry-after header on a Claude API 429?

It is the number of seconds the server suggests waiting before retrying. The official Anthropic SDKs honour it automatically. In custom HTTP clients, parse it and use that value (not a fixed sleep) as the base delay before adding jitter.

Should I retry a 429 immediately if there is no retry-after?

No. Use exponential backoff with full jitter: 1s, 2s, 4s, 8s, 16s, capped at ~30s, plus random jitter. Immediate retries from many concurrent workers cause a thundering herd that re-hits the limit.

Will Anthropic ban me for repeated 429s?

No. 429s are a normal signal of demand exceeding budget — they are designed to be retried. What does trigger abuse review is creating multiple API keys to circumvent org-level limits, or sending malformed traffic at high volume.

Does prompt caching help with 429 errors?

Indirectly. Cached reads still consume TPM, but at the cached-read rate. The dollar savings let you provision a higher effective budget per dollar. For pure TPM pressure, moving traffic to Batch API or Haiku is more direct.

How do I check my current Claude API rate-limit budget without making a real request?

Send a tiny messages.create call (one user message, max_tokens=1) and read the anthropic-ratelimit-* response headers. This costs ~$0.00001 and gives you live remaining-budget numbers for proactive throttling.

Free tools