A 2026 checklist for cutting Claude API spend: caching, routing, batch, prompt compression, output budgets, and live cost monitoring — with code.
If you are paying more than $0.002 per assistant turn on the Claude API in 2026, you almost certainly have unrealised savings. The pricing structure of Opus 4.7, Sonnet 4.6, and Haiku 4.5 — combined with prompt caching at 10% of input, the Batch API at 50% off, and per-request max_tokens control — gives operators five independent levers to pull. This page walks each one with the order we recommend pulling them in, and ends with a concrete cost model you can paste into a notebook.
Prompt caching is the single largest lever for most production workloads. If your system prompt, few-shot examples, tool definitions, or document context are reused across more than ~4 requests inside a 5-minute window, mark them with cache_control. Cached reads bill at 10% of the input price — a 90% discount on whatever is cached.
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=[
{"type": "text", "text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}}
],
messages=[{"role": "user", "content": question}],
)
Verify it worked: response.usage.cache_read_input_tokens should be non-zero on the second call.
Most production traffic isn't uniformly hard. A classification task and a multi-step refactor don't both need Opus 4.7. Default to Sonnet 4.6, escalate to Opus only when a complexity check signals it, and demote to Haiku 4.5 when the request matches a short-form template. The Prompt-Pricing Recommender automates this. A typical mix of 70% Sonnet / 20% Haiku / 10% Opus is about 4× cheaper than running everything on Opus, with negligible quality loss for non-reasoning tasks.
Anything that doesn't need a sub-second response — evals, bulk extraction, classification backfills, synthetic data generation — should run through the Batch API for a flat 50% discount. Batch and caching stack: a cached prefix on a batch request lands at 5% of the standard input rate.
Output tokens are 5× the cost of input on every Claude model. A turn capped at max_tokens=300 instead of max_tokens=4096 costs the same when the model produces 280 tokens — but prevents the runaway 4,000-token responses that wreck monthly bills. Pair tight caps with a stop sequence at logical boundaries (e.g. "\n\nUser:" for chat, "</answer>" for structured tasks).
Most chat apps over-send conversation history by 2–5×. Summarise turns older than the last 4–6 exchanges with a single Haiku call, replace the raw transcript with the summary, and keep the recent verbatim turns. This reduces input tokens per request linearly without a measurable quality drop on agent tasks.
def turn_cost(in_tok, out_tok, cached_tok=0, model="sonnet"):
prices = {
"opus": {"in": 15, "out": 75, "cache_read": 1.50},
"sonnet": {"in": 3, "out": 15, "cache_read": 0.30},
"haiku": {"in": 1, "out": 5, "cache_read": 0.10},
}[model]
fresh = in_tok - cached_tok
return (fresh * prices["in"] +
cached_tok * prices["cache_read"] +
out_tok * prices["out"]) / 1_000_000
Run this against last week's request logs (real input, output, and cache-hit counts) before optimising — the answers are usually surprising. The full breakdown is in Claude Prompt Caching Explained, and you can sanity-check totals in the Cost Calculator.