Claude API Caching Strategies

Five production prompt-caching patterns for the Claude API: system-prompt caching, document caching, conversation caching, tool caching, and tiered caching.

🔥 Launch tonight — Power Prompts PDF 50p (just 50p tonight)30 battle-tested Claude Code prompts · 8 pages · paste into CLAUDE.md · price reverts to £5

Claude's prompt caching prices cached reads at 10% of the input rate — a 90% discount on whatever you can keep stable across requests. The mechanism is simple (cache_control: {type: "ephemeral"} on a content block), but the strategy is where most teams leave money on the table. Here are the five patterns that account for almost all production caching wins, each with the workload it fits and an example.

Pattern 1: System-prompt caching

The default first move. If your system prompt is >1024 tokens (the minimum cacheable size) and reused across requests in a 5-minute window, mark it.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system=[
        {"type": "text",
         "text": LONG_SYSTEM_PROMPT,  # 4k+ tokens
         "cache_control": {"type": "ephemeral"}}
    ],
    messages=conversation,
)

Typical savings: 60–85% on input tokens for chat apps. Fits: assistants, agents, support bots.

Pattern 2: Document caching for RAG and Q&A

When users ask multiple questions about the same document, cache the document content rather than re-embedding/reseeding it.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    messages=[
        {"role": "user", "content": [
            {"type": "document",
             "source": {"type": "text", "media_type": "text/plain", "data": doc_text},
             "cache_control": {"type": "ephemeral"}},
            {"type": "text", "text": user_question}
        ]}
    ],
)

Typical savings: 85–95% on a 50k-token PDF asked 20 questions about. Fits: document Q&A, contract review, codebase exploration.

Pattern 3: Conversation prefix caching

In long multi-turn chats, mark the cache breakpoint at the most recent stable turn. Each subsequent turn reuses the prefix at 10%, paying full price only for the new user message and assistant response.

messages = [
    {"role": "user", "content": "Hello, I need help with X"},
    {"role": "assistant", "content": prior_response_1},
    {"role": "user", "content": "Now what about Y",
     "cache_control": {"type": "ephemeral"}},  # last stable point
    {"role": "assistant", "content": prior_response_2},
    {"role": "user", "content": new_question},  # NEW, uncached
]

Typical savings: 50–70% on long agent traces. Fits: coding agents, support tickets with long history.

Pattern 4: Tool definition caching

Tool-use applications often pass 5–50 tool definitions on every call. Tool blocks support cache_control too — mark the last tool to cache the entire tool array.

tools = [
    {"name": "search", "description": "...", "input_schema": {...}},
    {"name": "fetch_url", "description": "...", "input_schema": {...}},
    # ... 20 more tools
    {"name": "last_tool", "description": "...", "input_schema": {...},
     "cache_control": {"type": "ephemeral"}}
]

Typical savings: 70–90% on agentic workloads. Fits: agents with rich tool surfaces (research, coding, browser agents).

Pattern 5: Tiered cache breakpoints (up to 4)

Claude supports up to 4 cache breakpoints per request. Use this when different prefix portions have different stability:

  1. Breakpoint 1: tool definitions (changes weekly)
  2. Breakpoint 2: system prompt (changes daily)
  3. Breakpoint 3: documents loaded for this user (changes per session)
  4. Breakpoint 4: conversation history (changes per turn)

Each breakpoint creates a new cache layer. The longest-running cache (tools) keeps paying off across all users; the shortest (conversation) just covers the current session.

The 1-hour extended cache

Beyond the default 5-minute ephemeral cache, Anthropic offers a 1-hour cache at a higher write cost (~50% more than the 5-min cache, vs 25% normally). Worth it when the same prefix is reused over hours but not minutes — e.g. a daily-batch eval pipeline or a low-traffic internal tool.

How to verify it's working

Every response returns usage.cache_creation_input_tokens and usage.cache_read_input_tokens. Log both. On the first request you'll see cache_creation populate; on subsequent requests within the TTL, cache_read should be non-zero. If it stays zero, your content block changed (whitespace, ordering, anything).

Combine with model routing per cost optimization, and quantify savings in the Cost Calculator.

Frequently asked questions

What is the cheapest Claude API caching strategy?
Tiered breakpoints — combining tool, system-prompt, document, and conversation-prefix caching in a single request. Each layer hits independently and stacks. On agentic workloads with rich tools and long history, tiered caching routinely reaches 85–95% effective discounts on input tokens.
How long does Claude prompt caching last?
Two TTLs in 2026: the default 5-minute ephemeral cache (each access resets the timer) and a 1-hour extended cache at a ~25% higher write surcharge. Choose the 5-minute cache for active sessions and the 1-hour cache for low-frequency reuse like batch evals or daily reports.
Why are my Claude cache reads zero?
Almost always because the cached content block changed between requests. Caching is byte-exact: an extra space, a different message ordering, or a single character edit invalidates the cache. Inspect the exact request bytes (not your in-app structure) when debugging — and ensure the cache_control is on the same block in the same position each time.
Does Claude caching work with tool use?
Yes. Tool definitions in the tools array support cache_control on the final tool, which caches the entire tool block. This is especially valuable for agents with 10+ tools where the array dominates the input token count.

Free tools

Cost Calculator → Prompt-Pricing Recommender → Diff Summarizer → Skills Browser →

Related

Claude Opus 4.7 vs Sonnet 4.6 Pricing (2026 Comparison)How Much Does Claude Cost? (2026 API Pricing Guide)Claude Prompt Caching: 90% Cost Savings Explained (2026)Claude API Cost Calculator: Estimate Your Anthropic BillClaude vs GPT-4 Pricing: 2026 API Cost Comparison