A FastAPI Backend for LLM Apps: Streaming, Tools, and Sanity
The server layer that sits between your frontend and your models
Your frontend shouldn't talk to OpenAI or Anthropic directly. You need a server in the middle, to hold API keys, enforce limits, stream tokens, run tools, and give you one place to add auth and logging. FastAPI is an excellent fit: async-first, typed, and fast. Here's a clean backend for an LLM product.
This is the plumbing layer. It assumes you've already designed your prompts and tools, this article is about serving them reliably.
Project Shape#
Keep the API thin and push logic into a service layer. The route handler should orchestrate, not implement.
app/
├── main.py # FastAPI app + routes
├── schemas.py # Pydantic request/response models
├── services/
│ └── chat.py # LLM calls, streaming, tools
└── core/
└── config.py # settings via pydantic-settingsTyped Contracts#
Pydantic models are your API contract. Define them once; FastAPI validates requests and generates docs for free.
from pydantic import BaseModel, Field
class Message(BaseModel):
role: str
content: str
class ChatRequest(BaseModel):
messages: list[Message]
stream: bool = True
max_tokens: int = Field(default=1024, le=4096)Streaming with Server-Sent Events#
The single biggest UX win for LLM apps is streaming. Don't make users stare at a spinner for ten seconds, stream tokens as they arrive. FastAPI's StreamingResponse plus SSE is the clean way.
import json
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from anthropic import AsyncAnthropic
app = FastAPI()
client = AsyncAnthropic()
async def token_stream(req: ChatRequest):
async with client.messages.stream(
model="claude-opus-4-8",
max_tokens=req.max_tokens,
messages=[m.model_dump() for m in req.messages],
) as stream:
async for text in stream.text_stream:
yield f"data: {json.dumps({'delta': text})}\n\n"
yield "data: [DONE]\n\n"
@app.post("/chat")
async def chat(req: ChatRequest):
return StreamingResponse(
token_stream(req),
media_type="text/event-stream",
)On the frontend you read this with EventSource or a fetch reader and append each delta. The perceived latency drops from seconds to milliseconds.
Timeouts and Cancellation#
LLM calls hang. Networks fail. Without a timeout, a stuck request ties up a worker indefinitely.
import asyncio
async def with_timeout(coro, seconds: float = 60.0):
try:
return await asyncio.wait_for(coro, timeout=seconds)
except asyncio.TimeoutError:
raise HTTPException(status_code=504, detail="Upstream model timed out")When the client disconnects mid-stream, stop generating. Check await request.is_disconnected() in your generator and break, otherwise you keep paying for tokens nobody will read.
Error Handling That Doesn't Leak#
Map upstream failures to clean HTTP responses. Never surface raw provider errors or stack traces to clients.
from anthropic import APIStatusError, RateLimitError
from fastapi import HTTPException
def to_http_error(err: Exception) -> HTTPException:
if isinstance(err, RateLimitError):
return HTTPException(429, "Rate limited, please retry shortly.")
if isinstance(err, APIStatusError):
return HTTPException(502, "The model provider returned an error.")
return HTTPException(500, "Something went wrong.")A Reliability Checklist#
| Concern | Mechanism |
|---|---|
| Latency UX | SSE streaming |
| Stuck requests | asyncio.wait_for timeout |
| Wasted tokens | Client-disconnect detection |
| Abuse | Per-key rate limiting (e.g. slowapi) |
| Debugging | Structured request logging + request IDs |
| Secrets | Keys in env, never in the client |
Get these six right and you have a backend that's pleasant to use and boring to operate, which, for infrastructure, is the highest compliment. Layer your agent or RAG logic on top in the service module, and the API surface stays clean no matter how clever the internals get.
Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.
