Streaming responses

Advanced AI Workflows & the Claude API

Chapter 4  ·  Streaming Responses

Without streaming, your application sends a request and waits — potentially several seconds — before showing the user anything. With streaming, Claude sends its response token by token as it generates it, and your application displays each chunk immediately. The result feels instant rather than frozen. This chapter covers how streaming works at the API level, how to process the event stream, and the patterns for wiring it into a real application.

Streaming vs Non-Streaming

Non-streaming (default)
  • Single HTTP request → single response
  • User sees nothing until the full response is ready
  • Latency scales with output length — a 2,000-token response takes noticeably longer than a 200-token one
  • Simpler to implement — one call, one result object
  • Best for: background processing, batch jobs, pipelines where no human is waiting
Streaming
  • Single HTTP request → stream of server-sent events
  • First token arrives in ~300–500ms regardless of total length
  • User sees text appearing in real time — perceived latency is dramatically lower
  • More complex to implement — you process events in a loop
  • Best for: any UI where a human is reading the response as it arrives

The Event Stream — What Claude Actually Sends

Streaming uses Server-Sent Events (SSE). Claude sends a sequence of events, each a JSON payload, as it generates the response. You process them in order:

SSE event sequence — one streaming response
message_start
Response begins. Contains the message ID and initial usage (input tokens).
content_block_start
A content block is opening (type: "text" or "tool_use").
content_block_delta (×N)
The actual text chunks — one per token or small group of tokens. This is what you display to the user. Concatenate all deltas to build the full response.
content_block_stop
The content block is complete. If type was "tool_use", the tool arguments are now fully assembled.
message_delta
Contains stop_reason and final usage (output tokens). Check stop_reason here.
message_stop
Stream is complete. Safe to close the connection after this event.

Basic Streaming — Terminal Output

streaming_basic.pypython
import anthropic

client = anthropic.Anthropic()

# The SDK's stream context manager handles the event loop for you
with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a short poem about APIs."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True) # print each chunk immediately

# After the context manager exits, the final message is available
final_message = stream.get_final_message()
print(f"\nTokens used: {final_message.usage.output_tokens}")

The SDK's .stream() context manager and .text_stream iterator abstract away the raw SSE events. You get text chunks directly. For most applications this is all you need.

Streaming in a FastAPI Endpoint

To stream Claude's output to a web client, use FastAPI's StreamingResponse with a generator function. The client receives the response as a stream of chunks and can display them as they arrive:

streaming_fastapi.pypython
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import anthropic

app = FastAPI()
client = anthropic.Anthropic()

class ChatRequest(BaseModel):
    message: str

def generate_stream(user_message: str):
    with client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": user_message}]
    ) as stream:
        for text in stream.text_stream:
            yield text

@app.post("/chat")
async def chat(request: ChatRequest):
    return StreamingResponse(
        generate_stream(request.message),
        media_type="text/plain"
    )

Reading the Stream on the Client Side

On the browser side, use the Fetch API with a ReadableStream reader to consume the chunks as they arrive and append them to the UI:

streaming_client.jsjavascript
async function sendMessage(userMessage) {
  const response = await fetch("/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ message: userMessage })
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  const outputEl = document.getElementById("output");

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    outputEl.textContent += decoder.decode(value); // append each chunk
  }
}

Streaming with Tool Use

Tool use and streaming work together — but the interaction requires handling both text deltas and tool input deltas in the event stream. The SDK's lower-level event iterator gives you access to both:

streaming_tools.py — handling tool calls in a streampython
with client.messages.stream(
    model="claude-sonnet-4-6", max_tokens=1024,
    tools=tools, messages=messages
) as stream:
    for event in stream:

        if event.type == "content_block_delta":
            if hasattr(event.delta, "text"):
                print(event.delta.text, end="", flush=True)

        elif event.type == "message_stop":
            break

    # After stream closes, get the final assembled message
    # (tool_use blocks are fully assembled even in streaming mode)
    final = stream.get_final_message()
    if final.stop_reason == "tool_use":
        # handle tool call as in Chapter 3
        handle_tool_call(final)
Use get_final_message() for tool use
When streaming with tools, the tool arguments arrive as input_json_delta events — partial JSON fragments. Rather than assembling them yourself, let the stream finish and call stream.get_final_message(). The SDK assembles the complete tool call object for you, with fully parsed arguments.

When to Stream and When Not To

ScenarioStream?Reason
Chat interface — human reading the response Yes Dramatically improves perceived performance. Users can start reading while Claude is still writing.
Code generation in an editor Yes Seeing code appear incrementally feels fast; waiting for 200 lines feels frozen.
Background batch processing No No human is waiting. Non-streaming is simpler and uses the same API quota.
Structured output you parse (JSON, CSV) No You can't parse partial JSON. Wait for the full response before processing.
Classification or short answers No Responses are so short (~1–3 tokens) that streaming adds complexity with no perceived benefit.
Long document generation (>1,000 tokens) Yes Without streaming, the user stares at a spinner for 10–20 seconds. With it, they see progress immediately.

Production Patterns for Streaming

Buffer before rendering markdown
Markdown syntax like **bold** or ```code``` spans multiple tokens. If you render incrementally, you'll show half-rendered syntax. Buffer chunks until you detect a complete markdown element, or render raw text and re-parse periodically.
Show a typing indicator first
There's a ~300–500ms gap between request and first token. Show a typing indicator the moment the request is sent — before any tokens arrive. Remove it on the first delta. Prevents the UI looking frozen during the initial latency.
Handle disconnected clients
If the client disconnects mid-stream (page close, network drop), your generator should exit cleanly. Wrap the stream loop in a try/finally and cancel the upstream request if the client is gone to avoid burning tokens on output nobody reads.
Accumulate the full response
Even when streaming to the UI, accumulate the full text server-side. You'll need it to append to conversation history for the next turn, to log, and to check stop_reason. The stream's get_final_message() gives you this without extra work.
Streaming doesn't reduce token cost
Streaming changes when you receive the response, not how many tokens are generated. You're billed for the same input and output tokens whether you stream or not. If you're trying to reduce cost, look at model selection and prompt length — not streaming mode.
Next — Chapter 5: Building a Simple AI-Powered App
Putting it all together — a FastAPI backend with Claude at the core. System prompt design, conversation state management, streaming responses to the frontend, and the patterns that make an AI feature feel like a product rather than a demo.