Claude API Caching Strategies: 5 Patterns That Cut Costs 70-95%

Five production prompt-caching patterns for the Claude API: system-prompt caching, document caching, conversation caching, tool caching, and tiered caching.

Claude's prompt caching prices cached reads at 10% of the input rate — a 90% discount on whatever you can keep stable across requests. The mechanism is simple (cache_control: {type: "ephemeral"} on a content block), but the strategy is where most teams leave money on the table. Here are the five patterns that account for almost all production caching wins, each with the workload it fits and an example.

Pattern 1: System-prompt caching

The default first move. If your system prompt is >1024 tokens (the minimum cacheable size) and reused across requests in a 5-minute window, mark it.

Typical savings: 60–85% on input tokens for chat apps. Fits: assistants, agents, support bots.

Pattern 2: Document caching for RAG and Q&A

When users ask multiple questions about the same document, cache the document content rather than re-embedding/reseeding it.

Typical savings: 85–95% on a 50k-token PDF asked 20 questions about. Fits: document Q&A, contract review, codebase exploration.

Pattern 3: Conversation prefix caching

In long multi-turn chats, mark the cache breakpoint at the most recent stable turn. Each subsequent turn reuses the prefix at 10%, paying full price only for the new user message and assistant response.

Typical savings: 50–70% on long agent traces. Fits: coding agents, support tickets with long history.

Pattern 4: Tool definition caching

Tool-use applications often pass 5–50 tool definitions on every call. Tool blocks support cache_control too — mark the last tool to cache the entire tool array.

Typical savings: 70–90% on agentic workloads. Fits: agents with rich tool surfaces (research, coding, browser agents).

Pattern 5: Tiered cache breakpoints (up to 4)

Claude supports up to 4 cache breakpoints per request. Use this when different prefix portions have different stability:

Each breakpoint creates a new cache layer. The longest-running cache (tools) keeps paying off across all users; the shortest (conversation) just covers the current session.

The 1-hour extended cache

Beyond the default 5-minute ephemeral cache, Anthropic offers a 1-hour cache at a higher write cost (~50% more than the 5-min cache, vs 25% normally). Worth it when the same prefix is reused over hours but not minutes — e.g. a daily-batch eval pipeline or a low-traffic internal tool.

How to verify it's working

Every response returns usage.cache_creation_input_tokens and usage.cache_read_input_tokens. Log both. On the first request you'll see cache_creation populate; on subsequent requests within the TTL, cache_read should be non-zero. If it stays zero, your content block changed (whitespace, ordering, anything).

Frequently asked questions

What is the cheapest Claude API caching strategy?

Tiered breakpoints — combining tool, system-prompt, document, and conversation-prefix caching in a single request. Each layer hits independently and stacks. On agentic workloads with rich tools and long history, tiered caching routinely reaches 85–95% effective discounts on input tokens.

How long does Claude prompt caching last?

Two TTLs in 2026: the default 5-minute ephemeral cache (each access resets the timer) and a 1-hour extended cache at a ~25% higher write surcharge. Choose the 5-minute cache for active sessions and the 1-hour cache for low-frequency reuse like batch evals or daily reports.

Why are my Claude cache reads zero?

Almost always because the cached content block changed between requests. Caching is byte-exact: an extra space, a different message ordering, or a single character edit invalidates the cache. Inspect the exact request bytes (not your in-app structure) when debugging — and ensure the cache_control is on the same block in the same position each time.

Does Claude caching work with tool use?

Yes. Tool definitions in the tools array support cache_control on the final tool, which caches the entire tool block. This is especially valuable for agents with 10+ tools where the array dominates the input token count.

Free tools