The Claude API supports two response modes: streaming (server-sent events; tokens arrive as they're generated) and non-streaming (single JSON response after generation completes). Both cost the same per token — the choice is about UX and complexity, not price.
Streaming mode
Endpoint: same; pass stream: true.
Tokens arrive in chunks (5–50 tokens at a time, typically).
Time to first token (TTFT): 200–800ms typical.
Time to complete (TTC): proportional to output length.
Non-streaming mode
Default behavior; pass stream: false or omit.
One HTTP response with the full completion.
Latency: equals TTC of streaming mode.
Simpler. No SSE parsing, no chunked handling.
Pick streaming when
The UI shows generated text to a user in real time (chat, code editor, autocomplete).
Long outputs (>500 tokens) where users would otherwise stare at a spinner for 10+ seconds.
You want to detect/cancel runaway generations before they finish.
Pick non-streaming when
You only care about the final output (structured extraction, classification, batch processing).
Output is short (<200 tokens) — the streaming overhead isn't worth the parsing complexity.
You're composing the output into another system that needs the whole answer.
Cost is identical
Streaming does not change per-token pricing. Both modes bill the same input and output tokens. The only "cost" difference is engineering complexity — streaming clients need SSE handling and partial-message logic.
Tool use + streaming
Streaming works with tool calls but adds complexity: you'll receive partial tool-call JSON until it's complete. Most agent frameworks (LangChain, the Anthropic SDK) handle this for you; if you're rolling your own, expect to buffer tool-call deltas.
No — both modes bill the same per token. The only difference is delivery: streaming sends tokens as they're generated via server-sent events; non-streaming waits and returns the full response in one HTTP reply.
Is there a latency benefit to streaming?
Time-to-first-token is much lower with streaming (200–800ms typical) than time-to-complete in non-streaming mode. For long outputs this is the difference between a usable chat UI and an unusable one. Total generation time is identical.
Can I use streaming with tool calls?
Yes. Tool-call arguments arrive as partial JSON deltas that you buffer until the call is complete. Most SDKs (Anthropic's official Python/TypeScript SDKs, LangChain) handle this transparently.