Cutting LLM Cost and Latency Without Wrecking Quality
Measure first, then reach for caching, routing, smaller models, and the structural fixes that actually move the numbers
Every LLM app starts cheap and fast because nobody is using it. Then traffic arrives, the bill climbs, p95 latency creeps past two seconds, and someone asks why. The good news is that most LLM cost and latency comes from a few specific places, and the fixes are well understood. The trap is optimizing the wrong one and quietly tanking quality to save a few cents. This is the playbook: measure first, then pull the levers that pay off.
Measure before you touch anything#
You cannot optimize what you have not split apart. Before any change, get three numbers per request: input tokens, output tokens, and time to first token (TTFT) versus total time. Those tell you almost everything.
Cost is dominated by tokens, and output tokens usually cost several times more than input tokens, so a chatty model is expensive twice over: you pay more per output token and you wait longer for them. Latency splits into time to first token (mostly prompt processing and queueing) and the per-token generation rate after that. A long prompt hurts TTFT. A long answer hurts total time. Knowing which one is your problem decides which lever to pull.
Log token counts and timings on every request from day one. When the bill spikes, you want to query "which route, which model, what prompt size" in seconds, not add logging after the fact and wait a week for data.
Lever 1: prompt caching (the easiest win)#
If you send the same large prefix on many requests, a system prompt, a long set of instructions, a few-shot block, or retrieved context that repeats, prompt caching is close to free money. The provider stores the processed prefix and charges a fraction to reuse it.
The economics are real. Anthropic charges cache reads at 0.1x the standard input rate, a 90% discount, with cache writes at 1.25x for a 5-minute time-to-live or 2x for a 1-hour TTL. OpenAI does it automatically for prompts over 1,024 tokens with around a 50% discount and no code changes. So the same technique saves more on Anthropic but takes a line of config, and saves less on OpenAI but happens whether you ask or not.
# Anthropic: mark the stable prefix as cacheable.
client.messages.create(
model="claude-sonnet-4-6",
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # the same on every call
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": question}],
)The catch worth knowing: the cache matches on an exact prefix. Put your stable content first and the volatile content (the user's question, the timestamp) last, or you bust the cache on every request and pay the write premium for nothing. I go deeper on when this pays off in prompt caching for LLM apps.
Lever 2: route to the right model#
Most apps send every request to one big model, and most requests do not need it. Classifying intent, extracting a field, or answering a simple FAQ does not require your most expensive model. Routing easy work to a smaller, cheaper, faster model and reserving the big one for genuinely hard requests is often the single largest cost cut available.
def choose_model(task: str) -> str:
if task in {"classify", "extract", "route", "summarize_short"}:
return "claude-haiku-4-5" # cheap and fast
return "claude-opus-4-8" # save the big one for hard workA rough sense of the spread at 2026 list prices: Haiku 4.5 is $1 / $5 per million input/output tokens, Sonnet 4.6 is $3 / $15, and Opus 4.8 is $5 / $25. Moving the easy half of your traffic from Opus to Haiku is not a rounding error, it is a multiple. The discipline is to actually measure quality on the cheaper model for each task, not to assume it is fine.
Lever 3: stop generating so many tokens#
Output tokens are the expensive, slow ones, so the cheapest token is the one you do not generate. Two habits help. Cap max_tokens to what the task really needs instead of leaving it wide open. And ask for structured, terse output when you do not need prose: if your code only consumes three fields, request three fields, not a paragraph that wraps them.
# If you only need a label, do not let the model write an essay.
resp = client.messages.create(
model="claude-haiku-4-5",
max_tokens=20,
messages=[{"role": "user", "content": f"Classify sentiment: {text}"}],
)This also cuts latency directly, because total time scales with how many tokens come back.
Lever 4: semantic caching for repeated questions#
If users ask the same things in different words ("how do I reset my password" and "I forgot my password"), a semantic cache can skip the model entirely. You embed the query, look for a near-duplicate in a vector store, and if you find one above a similarity threshold, return the cached answer.
It is powerful and it is sharp. Set the threshold too low and you serve a stale or wrong answer to a question that only looked similar. Use it for stable, factual, high-volume queries (support FAQs), and keep it well away from anything personalized or time-sensitive. The embeddings background is in embeddings explained for engineers.
Lever 5: parallelize the work, not the tokens#
A lot of perceived latency is structural, not per-token. If your agent makes three independent tool calls in sequence, you wait for all three back to back. Fire them concurrently and you wait for the slowest one:
import asyncio
async def gather_context(query: str):
docs, profile, history = await asyncio.gather(
search_docs(query),
fetch_user_profile(),
fetch_recent_history(),
)
return docs, profile, historySame idea at the retrieval layer: a reranker lets you retrieve broadly and then trim to a small, high-quality context, which both improves answers and cuts the tokens you feed the model. I covered that in reranking in RAG. And streaming, while it does not lower true latency, makes the wait feel far shorter by showing tokens as they arrive, which I walk through in streaming LLM responses end to end.
Lever 6: batch the offline work#
If the work is not interactive (overnight enrichment, bulk classification, evals), use the provider's batch API. Anthropic and OpenAI both offer roughly 50% off for batched jobs that can tolerate a delayed return. There is no reason to pay real-time prices for work no human is waiting on.
Do not wreck quality chasing the number#
Every lever here can hurt quality if you over-apply it: too-aggressive routing sends hard questions to a weak model, a loose semantic cache serves wrong answers, and a tight max_tokens truncates a real answer. So the rule that keeps you honest is to keep a small evaluation set and re-run it after every optimization. If cost drops 40% and your eval score drops 1%, ship it. If the eval score drops 15%, you did not optimize, you degraded. The point is cheaper and faster at the same quality, and the only way to know is to measure both.
| Lever | Cuts cost | Cuts latency | Risk if overdone |
|---|---|---|---|
| Prompt caching | Yes | Yes (TTFT) | Low |
| Model routing | Yes | Yes | Weak model on hard tasks |
| Fewer output tokens | Yes | Yes | Truncated answers |
| Semantic cache | Yes | Yes | Wrong cached answer |
| Parallelism | No | Yes | Complexity |
| Batch API | Yes | No (offline) | Not for interactive use |
Start with caching and routing, because they are the highest return for the least risk. Measure quality after each change. The teams that keep LLM bills sane are not the ones with a clever trick, they are the ones who measure relentlessly and never trade away the quality they cannot see.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.