Streaming LLM Responses End to End: Backend to UI
Why SSE won, how the token stream flows from the provider to the browser, and where it breaks
Streaming is the difference between an app that feels alive and one that feels broken. A model that takes eight seconds to answer feels fast if the first words appear in 400ms, and unbearable if you stare at a spinner for all eight. The mechanics are not hard, but the failure modes are sneaky, and most of them live in the seams between the provider, your backend, and the browser. This is the whole path, end to end, and where it tends to break.
Why SSE, not WebSockets#
Token streaming is a one-way job. You send a prompt, the server sends tokens back, and the client does not need to interrupt mid-stream with new data. Server-Sent Events were designed for exactly this in 2012, and it shows. Every major provider streams over SSE: OpenAI, Anthropic, and Google Gemini all use it, and the Vercel AI SDK uses it under the hood.
SSE rides on a plain HTTP response, so it works through proxies and load balancers without special handling, reconnects on its own, and does not need a second protocol or a separate connection. WebSockets give you a full duplex channel you do not need here, plus more infrastructure to babysit. Reach for WebSockets when the client genuinely needs to push during the stream (live collaboration, voice). For chat-style token output, SSE is the right default.
The "stream" from OpenAI or Anthropic is itself SSE. So a naive proxy is just reading one SSE stream and writing another. The interesting work is everything you add in between: auth, tool calls, persistence, and turning provider events into your own event shape.
The shape of the path#
There are three hops, and each one can drop tokens or stall:
- Provider to your backend. The model API sends SSE chunks as it decodes.
- Your backend to the client. You re-emit those chunks, usually after transforming them, over your own SSE response.
- Client into the UI. The browser parses the event stream and appends text to state as it arrives.
Get any hop wrong and the symptom is the same from the outside: text arrives late, in one lump, or not at all. So you instrument each hop separately.
The backend: stream, do not buffer#
The single most common bug is buffering. Your framework, a proxy, or a gzip layer collects the whole response before flushing, and your beautiful token stream arrives as one block. The fix is to stream a generator and disable buffering on the response.
Here is a FastAPI endpoint that streams an Anthropic response as SSE:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from anthropic import AsyncAnthropic
app = FastAPI()
client = AsyncAnthropic()
@app.post("/chat")
async def chat(body: dict):
async def event_stream():
try:
async with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": body["message"]}],
) as stream:
async for text in stream.text_stream:
# SSE frame: "data: <payload>\n\n"
yield f"data: {text}\n\n"
yield "data: [DONE]\n\n"
except Exception as e:
# Send the error as an event so the client can show it,
# instead of the connection just dying silently.
yield f"event: error\ndata: {str(e)}\n\n"
return StreamingResponse(
event_stream(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"X-Accel-Buffering": "no", # tell nginx not to buffer
},
)The X-Accel-Buffering: no header is the one people forget. If you deploy behind nginx and tokens arrive in one lump in production but stream fine locally, that header is usually the fix.
Do not send raw user-controlled text into an SSE data: field without escaping newlines. A newline inside the payload ends the event early and corrupts the stream. Encode each chunk as JSON, or strip and re-frame newlines yourself.
Cancellation: the part everyone skips#
Users close tabs, navigate away, and hit stop. If you do not detect that, you keep paying the provider to generate tokens nobody will read. The model API call should be tied to the client connection, so when the client disconnects, you abort the upstream request.
On the client, that means using AbortController and actually calling it:
const controller = new AbortController();
const res = await fetch("/chat", {
method: "POST",
body: JSON.stringify({ message }),
signal: controller.signal,
});
// Later, when the user hits "stop" or the component unmounts:
controller.abort();On the backend, FastAPI exposes await request.is_disconnected(), and most SDK stream context managers will stop generating when the surrounding task is cancelled. Wire it through and your bill stops when the user leaves.
The client: parsing the stream#
You can read the raw stream yourself with the Fetch API and a ReadableStream reader, parsing data: lines as they come:
async function* readSSE(res: Response) {
const reader = res.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const frames = buffer.split("\n\n");
buffer = frames.pop() ?? ""; // keep the incomplete frame
for (const frame of frames) {
const line = frame.replace(/^data: /, "");
if (line === "[DONE]") return;
yield line;
}
}
}The detail that bites people: chunks do not arrive on tidy frame boundaries. One read() can hand you half an event, or two and a half events. So you buffer, split on the \n\n delimiter, and hold the trailing partial frame until more bytes arrive. Skip that and you will drop or mangle tokens under load.
Or use the AI SDK and skip the plumbing#
If you are in the React and Vercel world, the AI SDK has become the de facto standard for this. The useChat hook on the frontend handles SSE parsing, message state, loading and error states, and cancellation. On the backend you return toUIMessageStreamResponse() from a streamText result, and the two ends speak the same protocol.
// app/api/chat/route.ts
import { streamText } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
export async function POST(req: Request) {
const { messages } = await req.json();
const result = streamText({
model: anthropic("claude-sonnet-4-6"),
messages,
});
return result.toUIMessageStreamResponse();
}"use client";
import { useChat } from "@ai-sdk/react";
export function Chat() {
const { messages, input, handleInputChange, handleSubmit, stop } = useChat();
return (
<form onSubmit={handleSubmit}>
{messages.map((m) => (
<p key={m.id}>{m.content}</p>
))}
<input value={input} onChange={handleInputChange} />
<button type="button" onClick={stop}>Stop</button>
</form>
);
}The AI SDK's data stream protocol uses SSE with a start/delta/end pattern and unique IDs per text block, and it adds keep-alive pings and reconnection. If you build your own backend that feeds useChat, set the x-vercel-ai-ui-message-stream: v1 header so the client knows the format. The tradeoff is you adopt their event shape, which is fine until you need something it does not model, at which point you drop down to raw SSE.
Streaming with tool calls#
Real agents do not just stream text, they stream a mix of text, tool calls, and tool results. The clean approach is to give every event a type and let the UI decide how to render each one:
yield f"data: {json.dumps({'type': 'text', 'value': chunk})}\n\n"
yield f"data: {json.dumps({'type': 'tool_call', 'name': 'search_web'})}\n\n"
yield f"data: {json.dumps({'type': 'tool_result', 'name': 'search_web'})}\n\n"This is exactly the model that agent frameworks expose. Vercel's Eve, for instance, streams NDJSON lifecycle events like actions.requested and action.result alongside the text, so the UI can show "searching the web" while the model works.
A checklist before you ship#
- Disable buffering at every layer: app, framework, and proxy (
X-Accel-Buffering: no). - Encode chunks as JSON so newlines and special characters cannot break frames.
- Handle cancellation on both ends and abort the upstream call on disconnect.
- Buffer partial frames on the client and split on
\n\n. - Send errors as stream events, not as a silently dropped connection.
- Send periodic keep-alive pings if your proxy closes idle connections.
Streaming is mostly about respecting the seams. The model is the easy part. The hops between the model and the user's eyes are where the work is, and where a smooth product is won or lost.
For making that streamed output cheaper and faster without hurting quality, see cutting LLM cost and latency.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.