Streaming responses
Advanced AI Workflows & the Claude API
Without streaming, your application sends a request and waits — potentially several seconds — before showing the user anything. With streaming, Claude sends its response token by token as it generates it, and your application displays each chunk immediately. The result feels instant rather than frozen. This chapter covers how streaming works at the API level, how to process the event stream, and the patterns for wiring it into a real application.
Streaming vs Non-Streaming
- Single HTTP request → single response
- User sees nothing until the full response is ready
- Latency scales with output length — a 2,000-token response takes noticeably longer than a 200-token one
- Simpler to implement — one call, one result object
- Best for: background processing, batch jobs, pipelines where no human is waiting
- Single HTTP request → stream of server-sent events
- First token arrives in ~300–500ms regardless of total length
- User sees text appearing in real time — perceived latency is dramatically lower
- More complex to implement — you process events in a loop
- Best for: any UI where a human is reading the response as it arrives
The Event Stream — What Claude Actually Sends
Streaming uses Server-Sent Events (SSE). Claude sends a sequence of events, each a JSON payload, as it generates the response. You process them in order:
Basic Streaming — Terminal Output
client = anthropic.Anthropic()
# The SDK's stream context manager handles the event loop for you
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Write a short poem about APIs."}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True) # print each chunk immediately
# After the context manager exits, the final message is available
final_message = stream.get_final_message()
print(f"\nTokens used: {final_message.usage.output_tokens}")
The SDK's .stream() context manager and .text_stream iterator abstract away the raw SSE events. You get text chunks directly. For most applications this is all you need.
Streaming in a FastAPI Endpoint
To stream Claude's output to a web client, use FastAPI's StreamingResponse with a generator function. The client receives the response as a stream of chunks and can display them as they arrive:
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import anthropic
app = FastAPI()
client = anthropic.Anthropic()
class ChatRequest(BaseModel):
message: str
def generate_stream(user_message: str):
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": user_message}]
) as stream:
for text in stream.text_stream:
yield text
@app.post("/chat")
async def chat(request: ChatRequest):
return StreamingResponse(
generate_stream(request.message),
media_type="text/plain"
)
Reading the Stream on the Client Side
On the browser side, use the Fetch API with a ReadableStream reader to consume the chunks as they arrive and append them to the UI:
const response = await fetch("/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ message: userMessage })
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
const outputEl = document.getElementById("output");
while (true) {
const { done, value } = await reader.read();
if (done) break;
outputEl.textContent += decoder.decode(value); // append each chunk
}
}
Streaming with Tool Use
Tool use and streaming work together — but the interaction requires handling both text deltas and tool input deltas in the event stream. The SDK's lower-level event iterator gives you access to both:
model="claude-sonnet-4-6", max_tokens=1024,
tools=tools, messages=messages
) as stream:
for event in stream:
if event.type == "content_block_delta":
if hasattr(event.delta, "text"):
print(event.delta.text, end="", flush=True)
elif event.type == "message_stop":
break
# After stream closes, get the final assembled message
# (tool_use blocks are fully assembled even in streaming mode)
final = stream.get_final_message()
if final.stop_reason == "tool_use":
# handle tool call as in Chapter 3
handle_tool_call(final)
input_json_delta events — partial JSON fragments. Rather than assembling them yourself, let the stream finish and call stream.get_final_message(). The SDK assembles the complete tool call object for you, with fully parsed arguments.
When to Stream and When Not To
| Scenario | Stream? | Reason |
|---|---|---|
| Chat interface — human reading the response | Yes | Dramatically improves perceived performance. Users can start reading while Claude is still writing. |
| Code generation in an editor | Yes | Seeing code appear incrementally feels fast; waiting for 200 lines feels frozen. |
| Background batch processing | No | No human is waiting. Non-streaming is simpler and uses the same API quota. |
| Structured output you parse (JSON, CSV) | No | You can't parse partial JSON. Wait for the full response before processing. |
| Classification or short answers | No | Responses are so short (~1–3 tokens) that streaming adds complexity with no perceived benefit. |
| Long document generation (>1,000 tokens) | Yes | Without streaming, the user stares at a spinner for 10–20 seconds. With it, they see progress immediately. |
Production Patterns for Streaming
**bold** or ```code``` spans multiple tokens. If you render incrementally, you'll show half-rendered syntax. Buffer chunks until you detect a complete markdown element, or render raw text and re-parse periodically.stop_reason. The stream's get_final_message() gives you this without extra work.