Five production prompt-caching patterns for the Claude API: system-prompt caching, document caching, conversation caching, tool caching, and tiered caching.
Claude's prompt caching prices cached reads at 10% of the input rate — a 90% discount on whatever you can keep stable across requests. The mechanism is simple (cache_control: {type: "ephemeral"} on a content block), but the strategy is where most teams leave money on the table. Here are the five patterns that account for almost all production caching wins, each with the workload it fits and an example.
The default first move. If your system prompt is >1024 tokens (the minimum cacheable size) and reused across requests in a 5-minute window, mark it.
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=[
{"type": "text",
"text": LONG_SYSTEM_PROMPT, # 4k+ tokens
"cache_control": {"type": "ephemeral"}}
],
messages=conversation,
)
Typical savings: 60–85% on input tokens for chat apps. Fits: assistants, agents, support bots.
When users ask multiple questions about the same document, cache the document content rather than re-embedding/reseeding it.
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[
{"role": "user", "content": [
{"type": "document",
"source": {"type": "text", "media_type": "text/plain", "data": doc_text},
"cache_control": {"type": "ephemeral"}},
{"type": "text", "text": user_question}
]}
],
)
Typical savings: 85–95% on a 50k-token PDF asked 20 questions about. Fits: document Q&A, contract review, codebase exploration.
In long multi-turn chats, mark the cache breakpoint at the most recent stable turn. Each subsequent turn reuses the prefix at 10%, paying full price only for the new user message and assistant response.
messages = [
{"role": "user", "content": "Hello, I need help with X"},
{"role": "assistant", "content": prior_response_1},
{"role": "user", "content": "Now what about Y",
"cache_control": {"type": "ephemeral"}}, # last stable point
{"role": "assistant", "content": prior_response_2},
{"role": "user", "content": new_question}, # NEW, uncached
]
Typical savings: 50–70% on long agent traces. Fits: coding agents, support tickets with long history.
Tool-use applications often pass 5–50 tool definitions on every call. Tool blocks support cache_control too — mark the last tool to cache the entire tool array.
tools = [
{"name": "search", "description": "...", "input_schema": {...}},
{"name": "fetch_url", "description": "...", "input_schema": {...}},
# ... 20 more tools
{"name": "last_tool", "description": "...", "input_schema": {...},
"cache_control": {"type": "ephemeral"}}
]
Typical savings: 70–90% on agentic workloads. Fits: agents with rich tool surfaces (research, coding, browser agents).
Claude supports up to 4 cache breakpoints per request. Use this when different prefix portions have different stability:
Each breakpoint creates a new cache layer. The longest-running cache (tools) keeps paying off across all users; the shortest (conversation) just covers the current session.
Beyond the default 5-minute ephemeral cache, Anthropic offers a 1-hour cache at a higher write cost (~50% more than the 5-min cache, vs 25% normally). Worth it when the same prefix is reused over hours but not minutes — e.g. a daily-batch eval pipeline or a low-traffic internal tool.
Every response returns usage.cache_creation_input_tokens and usage.cache_read_input_tokens. Log both. On the first request you'll see cache_creation populate; on subsequent requests within the TTL, cache_read should be non-zero. If it stays zero, your content block changed (whitespace, ordering, anything).
Combine with model routing per cost optimization, and quantify savings in the Cost Calculator.