Mark stable prefixes — system prompt, examples, docs — with cache_control. Reads cost 10% of input.
2. Model routing (50–70% savings overall)
Most workloads are not uniformly hard. Route easy traffic to Haiku, ambiguous to Sonnet, hard to Opus. The Prompt-Pricing Recommender automates this.
3. Batch API (50% savings on offline work)
For evals, bulk extraction, classification — anything you don't need a response to in <24h.
4. Output token budget
Set max_tokens conservatively. A model asked for "a one-paragraph summary" with max_tokens=4096 will sometimes use it all. Set 300.
5. Structured output for short responses
JSON schemas constrain the model away from verbose preamble. Same task, fewer output tokens.
6. Stop sequences
For agentic flows, use stop sequences to truncate before the model rambles past a decision boundary.
7. Context pruning
Don't pass the entire conversation; summarize old turns. Most chat apps over-send context by 2–5×.
How much could you save?
Plug your workload into the Claude Cost Calculator and toggle the cache slider — that's usually the biggest lever.
Frequently asked questions
What is the single biggest lever for reducing Claude API costs?
Prompt caching, for most workloads. If your system prompt or document context is long and reused across requests, caching it drops the input cost to 10% of standard rate. On a 10k-token system prompt reused 1M times, that's 90% savings on input.
Does setting max_tokens affect what you pay?
You pay only for tokens actually generated, not for the max_tokens ceiling. However, setting a conservative max_tokens cap prevents runaway responses — a model given 4096 tokens often uses more than it needs. Setting a tight budget reduces output tokens and therefore cost.
How much can I save by routing to a cheaper model?
Haiku 4.5 is 15× cheaper than Opus 4.7 on input and output. Most teams find 60–80% of their traffic can run on Sonnet or Haiku without a meaningful quality drop. An 80% shift from Opus to Sonnet on a $10k/mo bill saves $6,400/mo.