Streaming responses

Advanced AI Workflows & the Claude API

Chapter 4 · Streaming Responses

Without streaming, your application sends a request and waits — potentially several seconds — before showing the user anything. With streaming, Claude sends its response token by token as it generates it, and your application displays each chunk immediately. The result feels instant rather than frozen. This chapter covers how streaming works at the API level, how to process the event stream, and the patterns for wiring it into a real application.

Streaming vs Non-Streaming

Non-streaming (default)

Single HTTP request → single response
User sees nothing until the full response is ready
Latency scales with output length — a 2,000-token response takes noticeably longer than a 200-token one
Simpler to implement — one call, one result object
Best for: background processing, batch jobs, pipelines where no human is waiting

Streaming

Single HTTP request → stream of server-sent events
First token arrives in ~300–500ms regardless of total length
User sees text appearing in real time — perceived latency is dramatically lower
More complex to implement — you process events in a loop
Best for: any UI where a human is reading the response as it arrives

The Event Stream — What Claude Actually Sends

Streaming uses Server-Sent Events (SSE). Claude sends a sequence of events, each a JSON payload, as it generates the response. You process them in order:

SSE event sequence — one streaming response

message_start

Response begins. Contains the message ID and initial usage (input tokens).

content_block_start

A content block is opening (type: "text" or "tool_use").

content_block_delta (×N)

The actual text chunks — one per token or small group of tokens. This is what you display to the user. Concatenate all deltas to build the full response.

content_block_stop

The content block is complete. If type was "tool_use", the tool arguments are now fully assembled.

message_delta

Contains stop_reason and final usage (output tokens). Check stop_reason here.

message_stop

Stream is complete. Safe to close the connection after this event.

Basic Streaming — Terminal Output

streaming_basic.pypython

      import anthropic

      client = anthropic.Anthropic()

      # The SDK's stream context manager handles the event loop for you

      with client.messages.stream(

          model="claude-sonnet-4-6",

          max_tokens=1024,

          messages=[{"role": "user", "content": "Write a short poem about APIs."}]

      ) as stream:

          for text in stream.text_stream:

              print(text, end="", flush=True)  # print each chunk immediately

      # After the context manager exits, the final message is available

      final_message = stream.get_final_message()

      print(f"\nTokens used: {final_message.usage.output_tokens}")

The SDK's .stream() context manager and .text_stream iterator abstract away the raw SSE events. You get text chunks directly. For most applications this is all you need.

Streaming in a FastAPI Endpoint

To stream Claude's output to a web client, use FastAPI's StreamingResponse with a generator function. The client receives the response as a stream of chunks and can display them as they arrive:

streaming_fastapi.pypython

      from fastapi import FastAPI

      from fastapi.responses import StreamingResponse

      from pydantic import BaseModel

      import anthropic

      app = FastAPI()

      client = anthropic.Anthropic()

      class ChatRequest(BaseModel):

          message: str

      def generate_stream(user_message: str):

          with client.messages.stream(

              model="claude-sonnet-4-6",

              max_tokens=2048,

              messages=[{"role": "user", "content": user_message}]

          ) as stream:

              for text in stream.text_stream:

                  yield text

      @app.post("/chat")

      async def chat(request: ChatRequest):

          return StreamingResponse(

              generate_stream(request.message),

              media_type="text/plain"

          )

Reading the Stream on the Client Side

On the browser side, use the Fetch API with a ReadableStream reader to consume the chunks as they arrive and append them to the UI:

streaming_client.jsjavascript

      async function sendMessage(userMessage) {

        const response = await fetch("/chat", {

          method: "POST",

          headers: { "Content-Type": "application/json" },

          body: JSON.stringify({ message: userMessage })

        });

        const reader = response.body.getReader();

        const decoder = new TextDecoder();

        const outputEl = document.getElementById("output");

        while (true) {

          const { done, value } = await reader.read();

          if (done) break;

          outputEl.textContent += decoder.decode(value);  // append each chunk

        }

      }

Streaming with Tool Use

Tool use and streaming work together — but the interaction requires handling both text deltas and tool input deltas in the event stream. The SDK's lower-level event iterator gives you access to both:

streaming_tools.py — handling tool calls in a streampython

      with client.messages.stream(

          model="claude-sonnet-4-6", max_tokens=1024,

          tools=tools, messages=messages

      ) as stream:

          for event in stream:

              if event.type == "content_block_delta":

                  if hasattr(event.delta, "text"):

                      print(event.delta.text, end="", flush=True)

              elif event.type == "message_stop":

                  break

          # After stream closes, get the final assembled message

          # (tool_use blocks are fully assembled even in streaming mode)

          final = stream.get_final_message()

          if final.stop_reason == "tool_use":

              # handle tool call as in Chapter 3

              handle_tool_call(final)

Use get_final_message() for tool use

When streaming with tools, the tool arguments arrive as input_json_delta events — partial JSON fragments. Rather than assembling them yourself, let the stream finish and call stream.get_final_message(). The SDK assembles the complete tool call object for you, with fully parsed arguments.

When to Stream and When Not To

Scenario	Stream?	Reason
Chat interface — human reading the response	Yes	Dramatically improves perceived performance. Users can start reading while Claude is still writing.
Code generation in an editor	Yes	Seeing code appear incrementally feels fast; waiting for 200 lines feels frozen.
Background batch processing	No	No human is waiting. Non-streaming is simpler and uses the same API quota.
Structured output you parse (JSON, CSV)	No	You can't parse partial JSON. Wait for the full response before processing.
Classification or short answers	No	Responses are so short (~1–3 tokens) that streaming adds complexity with no perceived benefit.
Long document generation (>1,000 tokens)	Yes	Without streaming, the user stares at a spinner for 10–20 seconds. With it, they see progress immediately.

Production Patterns for Streaming

Buffer before rendering markdown

Markdown syntax like **bold** or ```code``` spans multiple tokens. If you render incrementally, you'll show half-rendered syntax. Buffer chunks until you detect a complete markdown element, or render raw text and re-parse periodically.

Show a typing indicator first

There's a ~300–500ms gap between request and first token. Show a typing indicator the moment the request is sent — before any tokens arrive. Remove it on the first delta. Prevents the UI looking frozen during the initial latency.

Handle disconnected clients

If the client disconnects mid-stream (page close, network drop), your generator should exit cleanly. Wrap the stream loop in a try/finally and cancel the upstream request if the client is gone to avoid burning tokens on output nobody reads.

Accumulate the full response

Even when streaming to the UI, accumulate the full text server-side. You'll need it to append to conversation history for the next turn, to log, and to check stop_reason. The stream's get_final_message() gives you this without extra work.

Streaming doesn't reduce token cost

Streaming changes when you receive the response, not how many tokens are generated. You're billed for the same input and output tokens whether you stream or not. If you're trying to reduce cost, look at model selection and prompt length — not streaming mode.

Next — Chapter 5: Building a Simple AI-Powered App
Putting it all together — a FastAPI backend with Claude at the core. System prompt design, conversation state management, streaming responses to the frontend, and the patterns that make an AI feature feel like a product rather than a demo.