A FastAPI Backend for LLM Apps: Streaming, Tools, and Sanity, Folarin Akinloye

Your frontend shouldn't talk to OpenAI or Anthropic directly. You need a server in the middle, to hold API keys, enforce limits, stream tokens, run tools, and give you one place to add auth and logging. FastAPI is an excellent fit: async-first, typed, and fast. Here's a clean backend for an LLM product.

Note

This is the plumbing layer. It assumes you've already designed your prompts and tools, this article is about serving them reliably.

Project Shape#

Keep the API thin and push logic into a service layer. The route handler should orchestrate, not implement.

app/
├── main.py          # FastAPI app + routes
├── schemas.py       # Pydantic request/response models
├── services/
│   └── chat.py      # LLM calls, streaming, tools
└── core/
    └── config.py    # settings via pydantic-settings

Typed Contracts#

Pydantic models are your API contract. Define them once; FastAPI validates requests and generates docs for free.

from pydantic import BaseModel, Field
 
class Message(BaseModel):
    role: str
    content: str
 
class ChatRequest(BaseModel):
    messages: list[Message]
    stream: bool = True
    max_tokens: int = Field(default=1024, le=4096)

Streaming with Server-Sent Events#

The single biggest UX win for LLM apps is streaming. Don't make users stare at a spinner for ten seconds, stream tokens as they arrive. FastAPI's StreamingResponse plus SSE is the clean way.

import json
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from anthropic import AsyncAnthropic
 
app = FastAPI()
client = AsyncAnthropic()
 
async def token_stream(req: ChatRequest):
    async with client.messages.stream(
        model="claude-opus-4-8",
        max_tokens=req.max_tokens,
        messages=[m.model_dump() for m in req.messages],
    ) as stream:
        async for text in stream.text_stream:
            yield f"data: {json.dumps({'delta': text})}\n\n"
    yield "data: [DONE]\n\n"
 
@app.post("/chat")
async def chat(req: ChatRequest):
    return StreamingResponse(
        token_stream(req),
        media_type="text/event-stream",
    )

On the frontend you read this with EventSource or a fetch reader and append each delta. The perceived latency drops from seconds to milliseconds.

Timeouts and Cancellation#

LLM calls hang. Networks fail. Without a timeout, a stuck request ties up a worker indefinitely.

import asyncio
 
async def with_timeout(coro, seconds: float = 60.0):
    try:
        return await asyncio.wait_for(coro, timeout=seconds)
    except asyncio.TimeoutError:
        raise HTTPException(status_code=504, detail="Upstream model timed out")

Tip

When the client disconnects mid-stream, stop generating. Check await request.is_disconnected() in your generator and break, otherwise you keep paying for tokens nobody will read.

Error Handling That Doesn't Leak#

Map upstream failures to clean HTTP responses. Never surface raw provider errors or stack traces to clients.

from anthropic import APIStatusError, RateLimitError
from fastapi import HTTPException
 
def to_http_error(err: Exception) -> HTTPException:
    if isinstance(err, RateLimitError):
        return HTTPException(429, "Rate limited, please retry shortly.")
    if isinstance(err, APIStatusError):
        return HTTPException(502, "The model provider returned an error.")
    return HTTPException(500, "Something went wrong.")

A Reliability Checklist#

Concern	Mechanism
Latency UX	SSE streaming
Stuck requests	`asyncio.wait_for` timeout
Wasted tokens	Client-disconnect detection
Abuse	Per-key rate limiting (e.g. slowapi)
Debugging	Structured request logging + request IDs
Secrets	Keys in env, never in the client

Get these six right and you have a backend that's pleasant to use and boring to operate, which, for infrastructure, is the highest compliment. Layer your agent or RAG logic on top in the service module, and the API surface stays clean no matter how clever the internals get.

Note

This is the plumbing layer. It assumes you've already designed your prompts and tools, this article is about serving them reliably.

Project Shape#

Keep the API thin and push logic into a service layer. The route handler should orchestrate, not implement.

app/
├── main.py          # FastAPI app + routes
├── schemas.py       # Pydantic request/response models
├── services/
│   └── chat.py      # LLM calls, streaming, tools
└── core/
    └── config.py    # settings via pydantic-settings

Typed Contracts#

Pydantic models are your API contract. Define them once; FastAPI validates requests and generates docs for free.

from pydantic import BaseModel, Field
 
class Message(BaseModel):
    role: str
    content: str
 
class ChatRequest(BaseModel):
    messages: list[Message]
    stream: bool = True
    max_tokens: int = Field(default=1024, le=4096)

Streaming with Server-Sent Events#

The single biggest UX win for LLM apps is streaming. Don't make users stare at a spinner for ten seconds, stream tokens as they arrive. FastAPI's StreamingResponse plus SSE is the clean way.

import json
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from anthropic import AsyncAnthropic
 
app = FastAPI()
client = AsyncAnthropic()
 
async def token_stream(req: ChatRequest):
    async with client.messages.stream(
        model="claude-opus-4-8",
        max_tokens=req.max_tokens,
        messages=[m.model_dump() for m in req.messages],
    ) as stream:
        async for text in stream.text_stream:
            yield f"data: {json.dumps({'delta': text})}\n\n"
    yield "data: [DONE]\n\n"
 
@app.post("/chat")
async def chat(req: ChatRequest):
    return StreamingResponse(
        token_stream(req),
        media_type="text/event-stream",
    )

On the frontend you read this with EventSource or a fetch reader and append each delta. The perceived latency drops from seconds to milliseconds.

Timeouts and Cancellation#

LLM calls hang. Networks fail. Without a timeout, a stuck request ties up a worker indefinitely.

import asyncio
 
async def with_timeout(coro, seconds: float = 60.0):
    try:
        return await asyncio.wait_for(coro, timeout=seconds)
    except asyncio.TimeoutError:
        raise HTTPException(status_code=504, detail="Upstream model timed out")

Tip

When the client disconnects mid-stream, stop generating. Check await request.is_disconnected() in your generator and break, otherwise you keep paying for tokens nobody will read.

Error Handling That Doesn't Leak#

Map upstream failures to clean HTTP responses. Never surface raw provider errors or stack traces to clients.

from anthropic import APIStatusError, RateLimitError
from fastapi import HTTPException
 
def to_http_error(err: Exception) -> HTTPException:
    if isinstance(err, RateLimitError):
        return HTTPException(429, "Rate limited, please retry shortly.")
    if isinstance(err, APIStatusError):
        return HTTPException(502, "The model provider returned an error.")
    return HTTPException(500, "Something went wrong.")

A Reliability Checklist#

Concern	Mechanism
Latency UX	SSE streaming
Stuck requests	`asyncio.wait_for` timeout
Wasted tokens	Client-disconnect detection
Abuse	Per-key rate limiting (e.g. slowapi)
Debugging	Structured request logging + request IDs
Secrets	Keys in env, never in the client

A FastAPI Backend for LLM Apps: Streaming, Tools, and Sanity

Project Shape#

Typed Contracts#

Streaming with Server-Sent Events#

Timeouts and Cancellation#

Error Handling That Doesn't Leak#

A Reliability Checklist#

Related articles

Production RAG, Part 2: Measuring Retrieval Quality

Production RAG, Part 1: Chunking That Actually Works

Building an MCP Server for Job Search: A Deep Dive into FastMCP

A FastAPI Backend for LLM Apps: Streaming, Tools, and Sanity

Project Shape#

Typed Contracts#

Streaming with Server-Sent Events#

Timeouts and Cancellation#

Error Handling That Doesn't Leak#

A Reliability Checklist#

Related articles

Production RAG, Part 2: Measuring Retrieval Quality

Production RAG, Part 1: Chunking That Actually Works

Building an MCP Server for Job Search: A Deep Dive into FastMCP