Caching Agent Tool Calls (Not Just Prompts), Folarin Akinloye

Prompt caching gets all the attention, and it is genuinely good: cache the static front of your prompt and you cut input token cost and time to first token. I wrote about it in Prompt caching for LLM apps. But for an agent, the prompt is rarely the slow part. The slow part is the tool call: the web search that takes two seconds, the database query that scans a million rows, the third-party API that rate limits you and charges per request. Caching those results is where the real latency and cost wins are, and almost nobody does it, because it is genuinely harder than caching prompts.

The reason it is harder: a tool can have side effects, and its result can go stale. Cache get_account_balance for an hour and you will show someone the wrong number. Cache send_email at all and you will either send twice or not send when you should. So this is not "wrap everything in a cache". It is "decide, per tool, what is safe to cache and for how long".

First, sort your tools into three buckets#

Before writing any caching code, classify every tool the agent can call. This classification decides everything.

Pure and stable. Same input, same output, and the output does not change for a long time. Unit conversions, parsing, static reference lookups, embedding a fixed string. These are safe to cache aggressively, even forever.

Read-only but time-sensitive. No side effects, but the answer drifts. A stock price, a weather lookup, a search over a corpus that gets updated, today's calendar. Cacheable, but with a TTL that matches how fast the data moves. A weather forecast can be cached for an hour; a stock price for seconds.

Stateful or side-effecting. The call changes something or depends on changing state: writes, payments, sending messages, anything that mutates the world. Do not cache the result. You can sometimes cache around them with idempotency keys (more below), but you never serve a cached "success" for a write that did not happen.

TOOL_CACHE_POLICY = {
    "convert_units":      {"cache": True,  "ttl": None},      # pure, forever
    "lookup_zip_code":    {"cache": True,  "ttl": 86400},     # stable-ish, a day
    "search_docs":        {"cache": True,  "ttl": 600},       # drifts, 10 min
    "get_weather":        {"cache": True,  "ttl": 3600},      # drifts, an hour
    "get_account_balance":{"cache": False, "ttl": 0},         # too sensitive
    "send_email":         {"cache": False, "ttl": 0},         # side effect, never
}

Important

The default for a tool you have not classified is "do not cache". Caching is opt-in per tool. The cost of a cache miss is a slow call. The cost of wrongly caching a stateful tool is a correctness bug that is very hard to reproduce.

Exact-match caching: the workhorse#

For most read tools, an exact-match cache on the arguments gets you 80 percent of the benefit for very little code. The key is the tool name plus its normalized arguments.

import json, hashlib, time
 
_cache: dict[str, tuple[float, object]] = {}
 
def cache_key(tool_name: str, args: dict) -> str:
    # sort_keys is not optional: {"a":1,"b":2} and {"b":2,"a":1}
    # must produce the SAME key, or you never get a hit.
    normalized = json.dumps(args, sort_keys=True, separators=(",", ":"))
    return f"{tool_name}:{hashlib.sha256(normalized.encode()).hexdigest()}"
 
def cached_call(tool_name: str, args: dict, fn):
    policy = TOOL_CACHE_POLICY.get(tool_name, {"cache": False})
    if not policy["cache"]:
        return fn(**args)
 
    key = cache_key(tool_name, args)
    now = time.time()
    if key in _cache:
        expires, value = _cache[key]
        if policy["ttl"] is None or now < expires:
            return value   # hit
 
    value = fn(**args)     # miss
    ttl = policy["ttl"]
    _cache[key] = (float("inf") if ttl is None else now + ttl, value)
    return value

The one detail that quietly breaks this: argument ordering. If you serialize args without sorting keys, the same logical call produces different cache keys depending on dict order, and your hit rate craters. The same goes for the tool schema you send to the model: non-deterministic ordering there can break prompt caching upstream. Sort everything that gets hashed.

In production, swap the in-process dict for Redis so the cache is shared across workers and survives restarts. The logic is identical, the storage is just remote.

Semantic caching: for when the inputs are fuzzy#

Exact-match caching fails when the "same" call is worded differently. search_docs("how do I reset my password") and search_docs("password reset steps") are the same intent and almost certainly want the same results, but they hash to different keys. Semantic caching fixes this: embed the input, and on a new call do a vector similarity search against past inputs. If something is close enough (above a similarity threshold you set), return its cached result.

def semantic_cached_search(query: str, threshold: float = 0.95):
    q_emb = embed(query)
    hit = cache_index.query(vector=q_emb, top_k=1)
    if hit and hit[0]["score"] >= threshold:
        return hit[0]["metadata"]["result"]   # close enough, reuse
    result = search_docs(query)
    cache_index.upsert(vector=q_emb, metadata={"result": result, "query": query})
    return result

Semantic caching is powerful and dangerous in the same way. Set the threshold too low and you serve the result for a different question, which is worse than a cache miss because it is silently wrong. I keep the threshold high (0.95+ for cosine similarity) and only use semantic caching on genuinely read-only tools where a near-miss is tolerable. It does not belong anywhere near a stateful tool.

Warning

Semantic caching assumes a stateless input-to-output mapping. It does not apply to tool calls that depend on or change state. "Close enough" is fine for retrieval; it is a bug for anything that mutates the world or reads fast-moving data.

Stateful tools: idempotency, not result caching#

You cannot cache the result of a write, but you can stop a retried write from happening twice. That is idempotency, and it is the right tool for the side-effecting bucket. Give each logical operation a key, and have the tool (or the service behind it) refuse to perform the same operation twice for the same key.

@tool
def create_invoice(customer_id: str, amount: int, idempotency_key: str) -> dict:
    """Create an invoice. Safe to retry with the same idempotency_key."""
    if existing := invoices.get_by_key(idempotency_key):
        return existing            # already done, return the original result
    invoice = billing_api.create(customer_id, amount, idempotency_key)
    invoices.save(idempotency_key, invoice)
    return invoice

This matters for agents specifically because agents retry. A network blip, a re-planned step, a resumed run after human approval, all of these can replay a tool call. Without idempotency, "the agent retried the step" becomes "the customer was charged twice". The idempotency key turns a dangerous replay into a safe no-op.

Invalidation: the hard half#

Caching is easy until something changes. If search_docs is cached for ten minutes and a user uploads a new doc, they will not see it for ten minutes, which can look like a bug. Two ways to handle it:

TTL only: accept that data can be up to the TTL stale, and pick the TTL to make that acceptable. Simplest, and fine for most read tools.
Event-based invalidation: when the underlying data changes, delete the affected cache keys. More correct, more work. Worth it when staleness is user-visible and annoying.

For agents with long, multi-step sessions, there is a subtler version: a tool call early in the session can invalidate a cached result from earlier in the same session. If step 2 writes a record that step 5's read would return, step 5 must not read a pre-write cached value. The practical fix is to scope cache lifetime to the operation, and bust the relevant keys when a write in the same session touches that data.

Does it actually pay off?#

Caching has a cost: the first call is now slightly slower (you write to the cache), and you are running infrastructure to hold it. The win only shows up when reads repeat. So measure it. Track hit rate per tool, and the latency and cost saved.

# the only numbers that matter
hit_rate = hits / (hits + misses)
cost_saved = hits * avg_tool_cost
latency_saved_p50 = cache_miss_p50 - cache_hit_p50

If a tool's hit rate is near zero, caching it is pure overhead; turn it off. The tools worth caching are the ones called repeatedly with the same or similar arguments: shared reference lookups, popular searches, anything fronting a slow or metered API. For a fleet of users asking overlapping questions, a shared semantic cache on retrieval can cut both your bill and your p95 noticeably.

The short version#

Classify every tool into pure, read-but-drifting, or stateful. Cache the first two with exact-match keys (sort your arguments) and TTLs that match how fast the data moves. Reach for semantic caching only on read tools, with a high threshold. For stateful tools, do not cache results; use idempotency keys so retries are safe. Then measure hit rate and turn off anything that is not earning its keep.

This sits alongside the other cost and latency levers in Cutting LLM cost and latency without wrecking quality. Prompt caching trims the model bill; tool caching trims the part of the agent that is usually slower and more expensive than the model. Do both.

First, sort your tools into three buckets#

Before writing any caching code, classify every tool the agent can call. This classification decides everything.

TOOL_CACHE_POLICY = {
    "convert_units":      {"cache": True,  "ttl": None},      # pure, forever
    "lookup_zip_code":    {"cache": True,  "ttl": 86400},     # stable-ish, a day
    "search_docs":        {"cache": True,  "ttl": 600},       # drifts, 10 min
    "get_weather":        {"cache": True,  "ttl": 3600},      # drifts, an hour
    "get_account_balance":{"cache": False, "ttl": 0},         # too sensitive
    "send_email":         {"cache": False, "ttl": 0},         # side effect, never
}

Important

Exact-match caching: the workhorse#

For most read tools, an exact-match cache on the arguments gets you 80 percent of the benefit for very little code. The key is the tool name plus its normalized arguments.

import json, hashlib, time
 
_cache: dict[str, tuple[float, object]] = {}
 
def cache_key(tool_name: str, args: dict) -> str:
    # sort_keys is not optional: {"a":1,"b":2} and {"b":2,"a":1}
    # must produce the SAME key, or you never get a hit.
    normalized = json.dumps(args, sort_keys=True, separators=(",", ":"))
    return f"{tool_name}:{hashlib.sha256(normalized.encode()).hexdigest()}"
 
def cached_call(tool_name: str, args: dict, fn):
    policy = TOOL_CACHE_POLICY.get(tool_name, {"cache": False})
    if not policy["cache"]:
        return fn(**args)
 
    key = cache_key(tool_name, args)
    now = time.time()
    if key in _cache:
        expires, value = _cache[key]
        if policy["ttl"] is None or now < expires:
            return value   # hit
 
    value = fn(**args)     # miss
    ttl = policy["ttl"]
    _cache[key] = (float("inf") if ttl is None else now + ttl, value)
    return value

In production, swap the in-process dict for Redis so the cache is shared across workers and survives restarts. The logic is identical, the storage is just remote.

Semantic caching: for when the inputs are fuzzy#

def semantic_cached_search(query: str, threshold: float = 0.95):
    q_emb = embed(query)
    hit = cache_index.query(vector=q_emb, top_k=1)
    if hit and hit[0]["score"] >= threshold:
        return hit[0]["metadata"]["result"]   # close enough, reuse
    result = search_docs(query)
    cache_index.upsert(vector=q_emb, metadata={"result": result, "query": query})
    return result

Warning

Stateful tools: idempotency, not result caching#

@tool
def create_invoice(customer_id: str, amount: int, idempotency_key: str) -> dict:
    """Create an invoice. Safe to retry with the same idempotency_key."""
    if existing := invoices.get_by_key(idempotency_key):
        return existing            # already done, return the original result
    invoice = billing_api.create(customer_id, amount, idempotency_key)
    invoices.save(idempotency_key, invoice)
    return invoice

Invalidation: the hard half#

TTL only: accept that data can be up to the TTL stale, and pick the TTL to make that acceptable. Simplest, and fine for most read tools.
Event-based invalidation: when the underlying data changes, delete the affected cache keys. More correct, more work. Worth it when staleness is user-visible and annoying.

Does it actually pay off?#

# the only numbers that matter
hit_rate = hits / (hits + misses)
cost_saved = hits * avg_tool_cost
latency_saved_p50 = cache_miss_p50 - cache_hit_p50

Caching Agent Tool Calls (Not Just Prompts)

First, sort your tools into three buckets#

Exact-match caching: the workhorse#

Semantic caching: for when the inputs are fuzzy#

Stateful tools: idempotency, not result caching#

Invalidation: the hard half#

Does it actually pay off?#

The short version#

Related articles

Guardrails and Safety for Agents in Production

Observability for LLM Apps: What to Log, What to Alert On

Agent Memory: Short-Term vs Long-Term, and How to Wire It Up

Caching Agent Tool Calls (Not Just Prompts)

First, sort your tools into three buckets#

Exact-match caching: the workhorse#

Semantic caching: for when the inputs are fuzzy#

Stateful tools: idempotency, not result caching#

Invalidation: the hard half#

Does it actually pay off?#

The short version#

Related articles

Guardrails and Safety for Agents in Production

Observability for LLM Apps: What to Log, What to Alert On

Agent Memory: Short-Term vs Long-Term, and How to Wire It Up