Observability for LLM Apps: What to Log, What to Alert On
Your 500s and latency graphs will look fine while the product quietly gives wrong answers. LLM observability is about catching the failures that do not throw.
The scary thing about an LLM app in production is that it can be completely broken while every dashboard is green. No exceptions, no 500s, latency normal, and the model is confidently handing users wrong answers. Traditional observability was built for software that fails loudly. LLM apps fail quietly. So you have to log and alert on different things, and that is what this post is about: what to actually capture, and which signals are worth an alert versus a dashboard you check.
Start with the trace, because aggregates lie#
The single most useful thing you can capture is the full trace of one request. For a plain chat app that means the prompt, the response, the model and version, token counts, and latency. For an agent it means the whole tree: every model call, every tool call with its arguments and result, every retrieval step with what it returned. When something goes wrong, the trace is how you find out why, and no amount of aggregate metrics replaces being able to read exactly what happened on the one request that misbehaved.
A trace worth keeping has at least this:
{
"trace_id": "req_8f2a",
"user_id": "u_123",
"tenant_id": "acme",
"model": "gpt-4.1-mini",
"input": "...the actual prompt sent...",
"output": "...the actual response...",
"prompt_tokens": 1840,
"completion_tokens": 220,
"latency_ms": 1430,
"time_to_first_token_ms": 410,
"tool_calls": [
{"name": "search_docs", "args": {"q": "..."}, "latency_ms": 220, "result_size": 8}
],
"retrieved_chunk_ids": ["doc_4#2", "doc_9#0"],
"error": null,
"timestamp": "2026-06-24T09:12:03Z"
}The two fields people leave out and later wish they had: the retrieved chunk IDs (so you can tell whether a bad answer was bad retrieval or bad generation) and the exact prompt after templating (so you can reproduce it). Log the real prompt, not the template.
Prompts and outputs contain user data, and sometimes secrets. Decide your redaction and retention policy before you turn on full-payload logging, not after. Scrub PII on the way in, and never let one tenant's content land in a log another tenant can read. Observability that leaks data is its own incident.
The metrics that matter#
On top of traces, you want a handful of metrics rolled up over time. These split into three groups: cost, performance, and quality. The first two are easy and everyone tracks them. The third is the one that actually catches the silent failures, and it is the one most teams skip.
Cost. Input and output tokens per request, and the dollar cost derived from them. Track it per model, per feature, and per tenant or user. The reason to break it down: a single runaway agent or one abusive user can blow your bill, and an aggregate number hides that until the invoice arrives. Output tokens usually cost several times more than input, so watch them separately.
Performance. Latency, but as percentiles, never the average. The mean hides the tail, and the tail is what users feel. Track p50, p95, and p99, and track time to first token separately from total latency, because for a streaming UI the first token is what "feels" fast. Also track throughput and the rate of provider errors and rate-limit responses, which spike right before a bad day.
Quality. This is the LLM-specific part. Run lightweight evaluators on a sample of production traffic and turn their scores into metrics: groundedness (is the answer supported by what was retrieved), a relevance or helpfulness judge, format validity (did structured output parse), and refusal or fallback rate. These are the numbers that move when the product degrades without throwing an error. I covered how to wire these up in Evaluating agents with LangSmith; the online-eval half of that is exactly what feeds your quality dashboard.
| Group | Metric | Why |
|---|---|---|
| Cost | tokens in/out, $ per request | Catch runaway spend before the invoice |
| Cost | cost per tenant/feature | Find the one user or feature blowing the budget |
| Performance | latency p50 / p95 / p99 | Averages hide the tail users feel |
| Performance | time to first token | What "fast" means for a streaming UI |
| Performance | provider error / 429 rate | Early warning of a bad day |
| Quality | groundedness score | Catches confident wrong answers |
| Quality | format-valid rate | Catches broken structured output |
| Quality | refusal / fallback rate | Catches over-cautious or failing flows |
What to alert on, and what to just watch#
This is where teams go wrong in both directions: alerting on everything until the alerts are noise, or alerting only on crashes and missing the quiet failures. The filter I use: alert on things that are both bad and actionable right now. Everything else is a dashboard you look at, not a page that wakes you.
Worth an alert:
- Error and rate-limit spikes. A jump in provider 5xx or 429s means the app is failing now. Page on it.
- Cost anomalies. Spend per hour crossing a threshold, or jumping well above the trailing baseline. This is how you catch a prompt-injection loop or a runaway agent before it costs four figures.
- Latency p95 breaching your SLO. Not a 100ms wobble, a sustained breach of the number you promised.
- Quality dropping. Groundedness or format-valid rate falling below a floor on your sampled traffic. This is the alert traditional monitoring cannot give you, and it is the one that catches the silent failure. It is worth the effort to build.
Watch on a dashboard, do not page:
- Token usage trends, model mix, per-feature cost breakdowns.
- Individual low-scoring traces (review them in a queue, do not get paged per request).
- Cache hit rates and retrieval stats.
Alerts that only fire on 500s and latency miss the failure mode that actually erodes trust: the app answering smoothly and wrongly. If you build exactly one LLM-specific alert, make it a quality-score floor on sampled production traffic. That is the one that earns its keep.
A pragmatic setup#
You do not need to build this from scratch. The landscape splits into three kinds of tools, and most teams use one from the first or second group:
- AI-native tracing and eval platforms (LangSmith, Langfuse, and similar) go deep on the LLM-specific parts: trace trees, token accounting, online evaluators, prompt versioning. This is usually where I start, because the quality metrics are the hard part and these tools give them to you.
- General APM (Datadog, New Relic) now has LLM observability modules that sit next to your existing infra metrics, which is handy if your ops team already lives there.
- AI gateways (Helicone, Portkey, and others) sit between your app and the model provider and give you cost tracking, caching, and routing almost for free, since every call already flows through them.
Whatever you pick, the principle is the same. Instrument once, at a layer every call passes through, so you are not sprinkling logging code everywhere and forgetting half of it.
# instrument at the boundary, not at every call site
@observe() # captures inputs, outputs, tokens, latency for everything inside
def handle_request(req):
docs = retrieve(req.question) # traced
answer = model.invoke(...) # traced, tokens + latency captured
return answerThe short version#
Log the full trace of every request, including the real prompt and the retrieved chunk IDs, with a redaction policy decided up front. Roll up cost (tokens and dollars, broken down by tenant and feature), performance (latency percentiles and time to first token, never averages), and quality (groundedness, format validity, refusal rate from sampled online evals). Alert only on what is bad and actionable now: error spikes, cost anomalies, SLO breaches, and a quality-score floor. Put everything else on a dashboard.
The mindset shift from normal software: you are not just watching whether the app is up. You are watching whether it is right, and right does not throw an exception. Build the quality signal, because nothing else will tell you when the model quietly starts making things up. The cost side of this connects directly to Cutting LLM cost and latency without wrecking quality, and the quality side to Evaluating agents with LangSmith.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.