Claude API 429 — Rate-Limit Survival Guide

How to handle HTTP 429 from the Claude API in production. Exponential backoff, retry-after, jitter, and the patterns that keep your traffic flowing without DDoSing yourself.

🔥 Launch tonight — Power Prompts PDF 50p (just 50p tonight)30 battle-tested Claude Code prompts · 8 pages · paste into CLAUDE.md · price reverts to £5

HTTP 429 from the Claude API means you have exceeded a rate-limit window — usually requests-per-minute (RPM), tokens-per-minute (TPM), or tokens-per-day (TPD) for your tier. The official SDKs auto-retry, but production traffic needs a deliberate strategy. This page is the working playbook.

What a 429 response actually contains

HTTP/1.1 429 Too Many Requests
retry-after: 17
anthropic-ratelimit-requests-limit: 1000
anthropic-ratelimit-requests-remaining: 0
anthropic-ratelimit-requests-reset: 2026-05-19T03:18:45Z
anthropic-ratelimit-tokens-limit: 400000
anthropic-ratelimit-tokens-remaining: 0
anthropic-ratelimit-tokens-reset: 2026-05-19T03:18:32Z

{"type":"error","error":{"type":"rate_limit_error","message":"Number of request tokens has exceeded your per-minute rate limit"}}

The retry-after header is the authoritative wait time. The anthropic-ratelimit-* headers expose your real-time budget so you can throttle proactively.

The minimum-correct backoff (Python)

import time, random
import anthropic

client = anthropic.Anthropic()

def call_with_backoff(messages, attempts=6):
    for n in range(attempts):
        try:
            return client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                messages=messages
            )
        except anthropic.RateLimitError as e:
            wait = int(e.response.headers.get("retry-after", 2 ** n))
            wait += random.uniform(0, 0.5 * wait)  # full jitter
            time.sleep(wait)
    raise RuntimeError("Exceeded retry budget")

Three rules: (1) honour retry-after when present, (2) exponential backoff when absent, (3) add jitter so concurrent workers don't synchronise their retries and re-stampede the API.

Proactive throttling using ratelimit headers

Reactive 429-handling burns the first request that crosses the line. For high-throughput services, read anthropic-ratelimit-tokens-remaining on every response and pause new requests when it drops below your safety margin (e.g. 10% of limit). This converts 429s from errors into queued waits — a cleaner observability signal.

Five patterns that actually fix repeated 429s

PatternWhat it doesWhen to reach for it
Move to Batch API50% discount, separate rate-limit poolAny non-real-time workload >1k requests/day
Route low-difficulty traffic to HaikuHigher TPM ceiling per tierClassification, extraction, simple summaries
Prompt cachingCached reads still count toward TPM but cost 10%; net effect is more usable budget per dollarLong system prompts, repeated context
Token-aware queueingSchedule requests against a token-budget rather than a request-count budgetHeterogeneous request sizes
Tier-upSustained spend automatically promotes the accountLong-term ceiling, not acute spikes

What NOT to do

Observability — what to log

Forecast whether your workload is rate-limit safe with the Claude Cost Calculator, or compare your model choice against tier ceilings on Rate Limits by Tier.

Frequently asked questions

What is the retry-after header on a Claude API 429?
It is the number of seconds the server suggests waiting before retrying. The official Anthropic SDKs honour it automatically. In custom HTTP clients, parse it and use that value (not a fixed sleep) as the base delay before adding jitter.
Should I retry a 429 immediately if there is no retry-after?
No. Use exponential backoff with full jitter: 1s, 2s, 4s, 8s, 16s, capped at ~30s, plus random jitter. Immediate retries from many concurrent workers cause a thundering herd that re-hits the limit.
Will Anthropic ban me for repeated 429s?
No. 429s are a normal signal of demand exceeding budget — they are designed to be retried. What does trigger abuse review is creating multiple API keys to circumvent org-level limits, or sending malformed traffic at high volume.
Does prompt caching help with 429 errors?
Indirectly. Cached reads still consume TPM, but at the cached-read rate. The dollar savings let you provision a higher effective budget per dollar. For pure TPM pressure, moving traffic to Batch API or Haiku is more direct.
How do I check my current Claude API rate-limit budget without making a real request?
Send a tiny messages.create call (one user message, max_tokens=1) and read the anthropic-ratelimit-* response headers. This costs ~$0.00001 and gives you live remaining-budget numbers for proactive throttling.

Free tools

Cost Calculator → Prompt-Pricing Recommender → Diff Summarizer → Skills Browser →

Related

Claude Opus 4.7 vs Sonnet 4.6 Pricing (2026 Comparison)How Much Does Claude Cost? (2026 API Pricing Guide)Claude Prompt Caching: 90% Cost Savings Explained (2026)Claude API Cost Calculator: Estimate Your Anthropic BillClaude vs GPT-4 Pricing: 2026 API Cost Comparison