LangGraph State, Checkpointing, and Persistence Explained
How a graph remembers: state channels, checkpointers, threads, and the time travel you get for free
The first time you give a LangGraph agent memory, it feels like magic, and then it breaks in a way you do not understand because you do not actually know what is being saved or when. State, checkpoints, and threads are the three concepts that make a graph remember things across turns, and once they click, human-in-the-loop and resume-after-crash stop feeling like features and start feeling like consequences of the same mechanism. This is what each one is and how they fit together.
State is the data flowing through the graph#
A LangGraph graph passes a single state object from node to node. You define its shape, and each node returns updates to it. The important part is how updates get merged, which you control with a reducer per field. The classic example is messages: you do not want each node to overwrite the conversation, you want to append to it.
from typing import Annotated, TypedDict
from langgraph.graph import StateGraph
from langgraph.graph.message import add_messages
class State(TypedDict):
# add_messages is a reducer: node updates are appended, not replaced.
messages: Annotated[list, add_messages]
# No reducer here, so a node update replaces the value.
next_step: strWithout add_messages, every node that touched messages would clobber the history. The reducer is what makes the conversation accumulate. That distinction, append versus replace, is the most common source of "why did my history disappear" bugs.
A checkpointer saves state at every step#
On its own, the state lives only for one invocation and then it is gone. A checkpointer changes that. When you attach one, LangGraph saves a checkpoint of the full graph state at every superstep, which is each step of the graph's execution. That single behavior is what unlocks everything else:
- Memory between turns: the next message resumes from the saved state.
- Resume after failure: a crash mid-run does not lose progress, you re-run from the last checkpoint.
- Human-in-the-loop: the graph can pause, persist, and wait for a human, then continue.
- Time travel: you can rewind to an earlier checkpoint and branch from it.
You add it by compiling the graph with a checkpointer:
from langgraph.checkpoint.memory import MemorySaver
graph = builder.compile(checkpointer=MemorySaver())That is the whole wiring. Everything below is about which checkpointer to use and how to address the saved state.
Threads keep conversations separate#
A thread is one conversation's timeline. Each thread has its own thread_id and its own set of checkpoints, so two users (or two sessions) never see each other's state. You pass the thread_id in the config on every call, and the checkpointer uses it to load and save the right history.
config = {"configurable": {"thread_id": "user-123-session-7"}}
graph.invoke({"messages": [("user", "My name is Fola.")]}, config)
graph.invoke({"messages": [("user", "What's my name?")]}, config)
# Second call remembers, because it shares the thread_id.If you forget the thread_id, or generate a new one each call, your agent will look like it has no memory even though the checkpointer is working perfectly. "It forgot everything" is almost always a thread_id problem, not a checkpointer problem.
Picking a checkpointer#
The checkpointer is just an interface, and which implementation you choose decides where state lives and whether it survives a restart.
| Checkpointer | Where state lives | Survives restart | Use it for |
|---|---|---|---|
MemorySaver | Process memory | No | Local dev, tests, notebooks |
SqliteSaver | A SQLite file | Yes | Single-node apps, prototypes, local tools |
PostgresSaver | Postgres | Yes | Production, multiple workers |
MemorySaver loses everything when the process exits, so it is a development tool only. For anything that needs to survive a restart, move to SQLite locally and Postgres in production. The async variants (AsyncSqliteSaver, AsyncPostgresSaver) matter once you are serving real traffic, because a blocking checkpoint write on a hot path will hurt throughput.
# Local: survives restarts, single file.
from langgraph.checkpoint.sqlite import SqliteSaver
with SqliteSaver.from_conn_string("checkpoints.db") as saver:
graph = builder.compile(checkpointer=saver)
# Production: shared across workers.
from langgraph.checkpoint.postgres import PostgresSaver
with PostgresSaver.from_conn_string(POSTGRES_URL) as saver:
saver.setup() # creates the tables on first run
graph = builder.compile(checkpointer=saver)The checkpointer libraries live in separate packages (the langgraph-checkpoint family) so you only install the backend you use.
Human-in-the-loop falls out of persistence#
Because state is saved at every step, the graph can stop at a step, persist, and wait for a human, then resume from exactly there when the answer arrives. You mark where it should pause with an interrupt, and you resume by invoking the same thread again.
from langgraph.types import interrupt, Command
def approval_node(state: State):
decision = interrupt({"action": "refund", "amount": state["amount"]})
return {"approved": decision == "yes"}
# First run pauses at the interrupt and persists.
graph.invoke({"messages": [...]}, config)
# A human reviews, then you resume the same thread with their answer.
graph.invoke(Command(resume="yes"), config)The resume works because the checkpoint holds the full state at the pause point, so the graph does not start over, it continues. This is the same machinery I leaned on in human in the loop with Deep Agents; the underlying idea is identical, persist the state and wait.
Time travel: inspect and branch history#
Every checkpoint is addressable, so you can list a thread's history, pick an earlier checkpoint, and run forward from it with a different input. That is genuinely useful for debugging ("what was the state right before it went wrong?") and for letting users edit an earlier message and regenerate.
# Walk the saved checkpoints for a thread.
for snapshot in graph.get_state_history(config):
print(snapshot.config["configurable"]["checkpoint_id"], snapshot.values)
# Resume from a specific earlier checkpoint by passing its id in the config.Short-term state is not long-term memory#
One distinction that trips people up: checkpointers give you short-term, thread-scoped memory, the state of one conversation. They are not where you store durable facts about a user that should outlive the thread, like preferences or learned context. That is long-term memory, which is a separate store keyed by user rather than thread. I drew that line in detail in agent memory: short-term vs long-term. Use checkpointers for the conversation, and a separate memory store for the things you want to remember across conversations.
The mental model to keep#
State is the data moving through the graph, shaped by reducers. A checkpointer saves that state at every step. A thread_id decides which saved timeline you are reading and writing. Get those three right and memory, resume, human-in-the-loop, and time travel are not four features you build, they are four views of one persistence layer. Reach for MemorySaver while you build, switch to a real backend before you ship, and always, always pass the thread_id.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.