Claude API Cost Optimization Checklist (2026)

A 2026 checklist for cutting Claude API spend: caching, routing, batch, prompt compression, output budgets, and live cost monitoring — with code.

If you are paying more than $0.002 per assistant turn on the Claude API in 2026, you almost certainly have unrealised savings. The pricing structure of Opus 4.7, Sonnet 4.6, and Haiku 4.5 — combined with prompt caching at 10% of input, the Batch API at 50% off, and per-request max_tokens control — gives operators five independent levers to pull. This page walks each one with the order we recommend pulling them in, and ends with a concrete cost model you can paste into a notebook.

1. Cache anything that repeats

Prompt caching is the single largest lever for most production workloads. If your system prompt, few-shot examples, tool definitions, or document context are reused across more than ~4 requests inside a 5-minute window, mark them with cache_control. Cached reads bill at 10% of the input price — a 90% discount on whatever is cached.

Verify it worked: response.usage.cache_read_input_tokens should be non-zero on the second call.

2. Route by complexity

Most production traffic isn't uniformly hard. A classification task and a multi-step refactor don't both need Opus 4.7. Default to Sonnet 4.6, escalate to Opus only when a complexity check signals it, and demote to Haiku 4.5 when the request matches a short-form template. The Prompt-Pricing Recommender automates this. A typical mix of 70% Sonnet / 20% Haiku / 10% Opus is about 4× cheaper than running everything on Opus, with negligible quality loss for non-reasoning tasks.

3. Move offline work to Batch

Anything that doesn't need a sub-second response — evals, bulk extraction, classification backfills, synthetic data generation — should run through the Batch API for a flat 50% discount. Batch and caching stack: a cached prefix on a batch request lands at 5% of the standard input rate.

4. Cap output tokens aggressively

Output tokens are 5× the cost of input on every Claude model. A turn capped at max_tokens=300 instead of max_tokens=4096 costs the same when the model produces 280 tokens — but prevents the runaway 4,000-token responses that wreck monthly bills. Pair tight caps with a stop sequence at logical boundaries (e.g. "\n\nUser:" for chat, "</answer>" for structured tasks).

5. Compress context before sending

Most chat apps over-send conversation history by 2–5×. Summarise turns older than the last 4–6 exchanges with a single Haiku call, replace the raw transcript with the summary, and keep the recent verbatim turns. This reduces input tokens per request linearly without a measurable quality drop on agent tasks.

The cost model (paste this)

Run this against last week's request logs (real input, output, and cache-hit counts) before optimising — the answers are usually surprising. The full breakdown is in Claude Prompt Caching Explained, and you can sanity-check totals in the Cost Calculator.

Frequently asked questions

What is the biggest Claude API cost optimization in 2026?

Prompt caching, by a wide margin, for any workload with a stable system prompt or document context. Cached reads are billed at 10% of input price, so a 12k-token system prompt reused across 500k requests/month drops from $54/mo (Sonnet uncached) to $5.40/mo cached — and the absolute savings scale linearly with reuse.

Does Claude prompt caching work with the Batch API?

Yes — the discounts stack. Batch gives you 50% off the post-cache rate, so a cached prefix inside a batch request bills at 5% of the standard input price (10% cache rate × 50% batch discount).

How do I know if my Claude API costs are too high?

Compute average cost per assistant turn from your billing dashboard and compare to a Sonnet baseline of ~$0.003 (2k input + 500 output tokens, no caching). If you're materially above that, run the cost model in this page against last week's logs and look for high output token counts, no cache hits, or all-Opus routing.

Will model routing hurt response quality?

Rarely, if you route on prompt features rather than randomly. Classification, short-form extraction, and templated rewrites run on Haiku with no measurable quality drop. Save Opus for multi-step reasoning, unfamiliar codebases, or long-context retrieval — about 5–15% of typical traffic.

Free tools

Claude API Cost Optimization in 2026