Agent Memory: Short-Term vs Long-Term, and How to Wire It Up
Threads are short-term memory. Stores are long-term memory. Most agent memory bugs come from confusing the two.
"Add memory to the agent" is one request that hides two completely different jobs. One is remembering this conversation. The other is remembering this user across all conversations. They use different mechanisms, fail in different ways, and most memory bugs I have seen come from treating them as one thing. This post draws the line clearly and shows how to wire both up in LangGraph.
This follows on from context engineering for agents. Context engineering is about what goes into the prompt; memory is about where that material comes from across time.
The two kinds of memory#
Short-term memory is scoped to a thread. A thread is one conversation, like an email chain. Short-term memory is the running message history and any working state for that conversation: uploaded files, retrieved documents, intermediate results. It lives only as long as the thread is relevant.
Long-term memory is scoped to a namespace and shared across threads. This is what lets an agent remember your name, your preferences, and what you told it last week in a totally separate chat. It is recalled at any time, in any thread, because it is not tied to one conversation.
The cleanest way to hold the distinction: short-term memory answers "what were we just talking about?" Long-term memory answers "what do I know about this user?"
Short-term memory: checkpointers#
In LangGraph, short-term memory is the graph's state, persisted by a checkpointer. You pass a thread_id, and the checkpointer saves the state after each step and reloads it at the start of the next one. Same thread id, the conversation continues. New thread id, a fresh conversation with no history.
from langgraph.checkpoint.postgres import PostgresSaver
from langgraph.graph import StateGraph
checkpointer = PostgresSaver.from_conn_string(DB_URL)
graph = build_graph().compile(checkpointer=checkpointer)
# same thread_id resumes the conversation
config = {"configurable": {"thread_id": "user-123-session-7"}}
graph.invoke({"messages": [user_msg]}, config)Use an in-memory saver while developing and a database-backed one (Postgres, Redis, and similar) in production, so threads survive restarts.
The real problem: long threads#
The hard part of short-term memory is not saving it, it is that conversations grow past what the model can use well. Even with a huge context window, models get distracted by stale, off-topic history, and you pay for every token of it on every turn. So you have to actively manage the history rather than just letting it pile up.
Two common moves. Trim: keep the last N messages (or the last N tokens) and drop the rest. Summarize: when history gets long, replace the old turns with a running summary and keep recent turns verbatim.
def manage_history(messages, keep_recent=10):
if len(messages) <= keep_recent:
return messages
old, recent = messages[:-keep_recent], messages[-keep_recent:]
summary = summarize(old) # one LLM call
return [SystemMessage(content=f"Summary so far: {summary}")] + recentTrimming naively can drop the tool call but keep the tool result (or the reverse), which confuses the model and some APIs reject it. When you trim, keep call-and-result pairs together. This is a sharp edge that breaks agents in subtle ways.
Long-term memory: stores#
Long-term memory lives in a store: LangGraph stores memories as JSON documents organized by a namespace (like a folder, often keyed by user or org id) and a key (like a filename). You put memories in, and you search them out, optionally by vector similarity if you configure embeddings.
from langgraph.store.memory import InMemoryStore
# in production use a DB-backed store; embeddings enable semantic search
store = InMemoryStore(index={"embed": embed_fn, "dims": 1536})
namespace = ("user-123", "preferences")
store.put(namespace, "comm-style", {
"rules": ["Likes short, direct answers", "Prefers Python examples"],
})
# later, in any thread, recall by meaning
hits = store.search(namespace, query="how should I format replies?")Note that "semantic memory" and "semantic search" are different words for different things. Semantic memory means storing facts (a psychology term). Semantic search means retrieving by meaning using embeddings. You often use semantic search to retrieve semantic memories, which is probably why the names collide.
Three flavours of long-term memory#
The useful breakdown (it maps onto how humans remember) is semantic, episodic, and procedural.
| Type | Stores | Agent example |
|---|---|---|
| Semantic | Facts | "This user works in finance and prefers Python" |
| Episodic | Experiences | Past successful task runs, used as few-shot examples |
| Procedural | Instructions | The agent's own system prompt, refined over time |
Semantic memory is facts about the user or domain. You can hold these as a single profile document you keep updating, or as a growing collection of small memory documents. A profile is easy to feed to the model but gets error-prone to update as it grows. A collection is easier to extend and tends to recall better, but you have to manage updating and deleting items so it does not drift. Pick based on whether your facts are a tidy fixed set (profile) or open-ended and growing (collection).
Episodic memory is past experiences, usually surfaced as few-shot examples. "Here is how you handled a similar request before." Sometimes it is easier to show the model a good past example than to describe the rule.
Procedural memory is the agent's instructions. The interesting version is an agent that refines its own system prompt from feedback: you prompt it with its current instructions plus what went wrong, and it rewrites them. This is "reflection," and it is genuinely useful for tasks where you cannot specify the perfect instructions up front.
When do you write a memory?#
Two strategies, and the tradeoff is latency versus freshness.
In the hot path: the agent decides what to remember during the conversation, before it replies. Memories are available immediately and you can show the user "I'll remember that." The cost is latency and complexity: the agent is now multitasking between answering and curating memory.
In the background: a separate job extracts memories after the fact, asynchronously. No latency hit on the main response, cleaner separation, but new memories are not instantly available and you have to decide when the job runs (after each session, on a schedule, on a trigger).
For most apps I start with background writing on session end. It keeps the live path fast, and memories being a minute stale almost never matters.
Putting it together#
A working agent uses both at once. On each turn: load the thread's short-term state (the checkpointer does this), pull relevant long-term memories for this user from the store, build the prompt from both, respond, and let a background job decide what new long-term memories to write.
def agent_turn(state, store, *, user_id):
# long-term: recall facts about this user
memories = store.search((user_id, "facts"), query=state["messages"][-1].content)
profile = "\n".join(m.value["text"] for m in memories)
# short-term comes from state (loaded by the checkpointer)
prompt = build_prompt(system=BASE + profile, history=state["messages"])
reply = llm.invoke(prompt)
return {"messages": [reply]}Do not store entire conversations as long-term memory. That is just a slower, more expensive version of short-term memory. Long-term memory should be distilled facts and lessons, not raw transcripts.
Wrapping up#
Short-term memory is thread-scoped working state, persisted by a checkpointer, and its real challenge is managing long histories without drowning the model. Long-term memory is cross-session knowledge in a namespaced store, split into semantic facts, episodic experiences, and procedural instructions, written either in the hot path or in the background. Keep the two clearly separate and most agent memory problems stop being mysterious.
If you are building agents from the ground up, this fits alongside what AI agents and multi-agent systems are and evaluating agents with LangSmith.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.