Anthropic offers two distinct cost-reduction levers on top of base API pricing: prompt caching (a 90% discount on cached portions of a prompt) and the Batch API (a 50% discount on the whole request, with a 24-hour SLA). They are not mutually exclusive — but they apply to very different workloads.
How each works
Prompt caching
You mark a prefix of your prompt (system message, document context, tool definitions) as cacheable.
First request: pay cache-write price (~25% more than input).
Subsequent requests within the cache TTL (5 min or 1 hour): pay cached-read price = 10% of normal input.
Net savings appear after 2–3 reuses.
Batch API
Submit a JSONL file of up to 10,000 requests.
Get results within 24 hours (often much faster).
Pay 50% of standard pricing on input and output.
No real-time response — purely async.
When each saves more
Workload
Better lever
Why
Real-time chat with shared system prompt
Caching
Reuse pays off in 3 requests
Offline classification of 100k docs
Batch
50% off whole job, no SLA constraint
RAG with same retrieved chunks
Caching
Cache the chunks once
Eval suite on a model upgrade
Batch
Save 50% on a one-shot offline run
Long-context document Q&A (sync)
Caching
Document context cached across user questions
Nightly data enrichment pipeline
Batch
Async by definition
Can you stack them?
No — Batch API requests do not use cache. Pick the right lever for the workload. If you need sync latency, use caching. If async is acceptable, Batch usually wins because 50% off the entire request beats 90% off only the cached portion when the cached portion is small.
Can I use prompt caching and the Batch API at the same time?
No. Batch API requests do not benefit from prompt caching. For each workload, pick the lever that fits — caching for sync flows with repeated prefixes, Batch for async workloads with no real-time requirement.
How long does Claude's prompt cache last?
Anthropic offers two TTLs: 5-minute (default, cheaper cache-write price) and 1-hour (~2× cache-write cost but lasts 12× longer). Pick 1-hour for slow-paced chat sessions; 5-minute for high-frequency workloads.
Is the Batch API really 50% off?
Yes. Both input and output tokens are charged at 50% of standard real-time pricing. The only constraint is async delivery — your results land within 24 hours (often within minutes for small batches).