Evaluating Agents with LangSmith: A Complete Guide
Why "it worked when I tried it" is not evaluation, and how to measure agents properly with tracing, datasets, evaluators, and experiments
You cannot improve an agent you cannot measure, and "it worked when I tried it" is not measurement. It is the most common way teams fool themselves. You build an agent, run it on three prompts you made up, it looks great, you ship it, and then real users find the dozen cases you never thought to try. This post is about closing that gap with LangSmith: tracing what your agent does, building datasets of cases that matter, scoring them automatically, and tracking whether each change actually makes things better.
This is the fifth post in the series. We have built agents, made retrieval agentic, given them tools, and shipped MCP servers. Now we make sure all of it actually works, and keeps working.
Why agents are genuinely hard to evaluate#
Normal software is easy to test: same input, same output, assert equality. Agents break every part of that.
They are non-deterministic. Run the same prompt twice and you can get two different answers, both valid. Exact-match assertions are useless.
There is rarely one right answer. "Summarise this" or "plan my trip" has a range of good responses and a range of bad ones, and the line between them is fuzzy.
They are multi-step. An agent does not just produce an answer, it takes a path: which tools it called, in what order, with what arguments. The final answer can be right for the wrong reasons, or wrong because one step in the middle failed.
So evaluating an agent is not "did it return the expected string." It is "was the answer good, and did it get there sensibly," asked across enough cases to actually trust. That is the problem LangSmith is built for.
What LangSmith actually is#
LangSmith is an observability and evaluation platform for LLM apps. Strip away the marketing and it is four things that work together:
Tracing records every run of your app in detail: each model call, each tool call, the inputs, the outputs, the timing. This is your x-ray view of what the agent actually did.
Datasets are collections of examples, each with inputs and optionally a reference output. This is your test set.
Evaluators are functions that score an output. They can be simple rules, an LLM acting as a judge, or a human.
Experiments are what you get when you run a dataset through your app and score it with evaluators. Run one per change and you can see, with numbers, whether you are improving.
The workflow ties them together: trace production to see what really happens, curate the interesting cases into a dataset, write evaluators for what "good" means, and run experiments every time you change something.
Step 1: turn on tracing#
Everything starts with tracing, and the nice part is it is almost free to enable. Set two environment variables and LangSmith captures your runs.
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY=<your-api-key>
# older aliases LANGCHAIN_TRACING_V2 and LANGCHAIN_API_KEY still workIf you build with LangChain or LangGraph, that is all you need. Every chain, agent, and tool call is traced automatically, no code changes. If you are calling a model directly, wrap the client and decorate your functions so LangSmith can see inside them.
from langsmith import traceable, wrappers
from openai import OpenAI
# Wrap the client to trace every model call.
client = wrappers.wrap_openai(OpenAI())
# Decorate any function you want to see as a step in the trace.
@traceable
def classify(text: str) -> str:
result = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": text}],
)
return result.choices[0].message.contentNow go run your agent and open LangSmith. You will see the full tree of what happened: the prompt, every tool call, every model response, how long each took. This alone is worth the setup, because most "why did the agent do that" questions are answered just by reading the trace. Evaluation builds on top of it.
Step 2: build a dataset#
A dataset is your test set: the cases you want to be sure your agent handles. Each example has inputs, and optionally a reference output to compare against.
The trap is inventing examples off the top of your head. The best datasets come from reality: trace production, find the runs that were interesting or went wrong, and save them as examples. LangSmith lets you add a trace to a dataset in a click, which is the single most useful habit in this whole post. Your test set should look like your real traffic, not like what you imagined your traffic would be.
You can also create a dataset in code:
from langsmith import Client
client = Client()
examples = [
("Does Deloitte sponsor UK visas?", "yes"),
("What's the capital of the moon?", "unknown"),
("Find Python jobs in Berlin over 70k EUR", "search_jobs"),
]
dataset = client.create_dataset(dataset_name="agent-regression-set")
client.create_examples(
inputs=[{"question": q} for q, _ in examples],
outputs=[{"answer": a} for _, a in examples],
dataset_id=dataset.id,
)Start small. Twenty good examples drawn from real failures beat two hundred made-up ones. You grow the set every time a new failure shows up, which turns every bug into a permanent regression test.
Step 3: write evaluators#
An evaluator scores one example. It receives the example inputs, your agent's actual output, and the reference output if there is one, and returns a score. There are three kinds, and you will use all of them.
Heuristics are plain code. Is the output valid JSON? Is it non-empty? Does it match the reference exactly? Cheap, fast, and perfect for the things that have a clear right answer.
def is_valid_json(outputs: dict) -> bool:
import json
try:
json.loads(outputs["answer"])
return True
except ValueError:
return FalseLLM-as-judge uses a model to score things code cannot, like "is this answer faithful to the retrieved context" or "is this rude." This is how you grade open-ended output at scale. LangSmith ships a library of off-the-shelf judge evaluators with tuned prompts, so you do not have to write the correctness or hallucination judge from scratch, but you can always write your own.
Human evaluation is you, or a teammate, scoring runs in the UI. Slowest, most trusted, and the thing you use to check that your automatic evaluators actually agree with human judgement. A common pattern is to bootstrap labels with an LLM judge, then correct a sample by hand.
A ground-truth evaluator with a reference looks like this:
def correct(outputs: dict, reference_outputs: dict) -> bool:
return outputs["answer"] == reference_outputs["answer"]Do not over-engineer your first evaluator. A binary correct/incorrect plus one LLM judge for "is this answer good" gets you most of the value. Add sharper metrics once you know what actually goes wrong.
Step 4: run an experiment#
An experiment runs your whole dataset through your agent and scores every output. The evaluate() function ties it together: give it your app, the dataset, and a list of evaluators.
from langsmith import evaluate
def run_agent(inputs: dict) -> dict:
answer = my_agent.invoke(inputs["question"])
return {"answer": answer}
results = evaluate(
run_agent,
data="agent-regression-set",
evaluators=[correct, is_valid_json],
experiment_prefix="v2-new-prompt", # name it so you can compare later
description="Testing the rewritten system prompt.",
)For larger jobs, use aevaluate(), the async version, which runs examples concurrently and is much faster. Either way, you get an experiment in LangSmith with a score per example and an aggregate. The real power is the comparison view: run the same dataset against your old prompt and your new one, and LangSmith shows them side by side so you can see exactly which examples got better and which regressed. That is how you stop guessing whether a change helped.
What makes agent evaluation different#
Everything above applies to any LLM app. Agents add one crucial wrinkle: you are not just grading the final answer, you are grading the path. LangSmith frames agent evaluation at three levels, and you want all three.
Final response. Did the agent end up with a good answer? This is the outcome, and it is what users feel. It is necessary but not sufficient, because an agent can luck into a right answer through a terrible process that will fail next time.
Trajectory. Did it take a sensible path? For an agent, this means the tool calls: did it call the right tools, in a reasonable order, with sensible arguments, and not loop or wander. Because LangSmith traces every tool call, you can write evaluators that inspect the trajectory, not just the final text. This is where you catch the agent that returns the right answer but called your expensive API nine times to get it.
Single step. Did one specific decision go right? Sometimes you want to isolate and grade just the router, or just the tool-selection step, in isolation. This is how you debug which part of a multi-step agent is the weak link.
The lesson from the earlier posts pays off here: because you gave your tools clear names and your agent a clean structure, the trace is readable, and a readable trace is an evaluable one.
Offline and online: testing and monitoring#
So far this is offline evaluation: run a fixed dataset before you ship. That catches regressions. But agents meet inputs you never put in your dataset, so you also want online evaluation: evaluators that run automatically on live production traces. The same LLM-as-judge that scores your dataset can score real traffic as it happens, flagging low-scoring runs for you to review and, ideally, add to your dataset.
That closes the loop. Production traffic surfaces new failures, online evaluators flag them, you curate the worst into your dataset, and your offline test set keeps getting more realistic. The agent improves because the measurement improves.
How I actually use it#
In practice my loop is simple. I trace everything from day one, because the cost is near zero and the payoff is huge the first time something breaks. When a run goes wrong, I read the trace, fix it, and add that case to my dataset so it can never silently break again. Before any meaningful change to a prompt, model, or tool, I run the dataset as an experiment and compare it to the last one. If the numbers go up and nothing important regresses, I ship. If a few examples got worse, the comparison view tells me exactly which, and I decide if the trade is worth it.
It is not glamorous, and that is the point. Evaluation is the boring discipline that separates an agent demo from an agent product.
Wrapping up#
You cannot improve what you cannot measure, and with agents the measuring is the hard part. LangSmith gives you the four pieces that make it tractable: tracing to see what happened, datasets built from real cases, evaluators that score both the final answer and the path it took, and experiments that tell you whether each change actually helped. Turn on tracing today, curate a small dataset from your real traffic, write one heuristic and one judge, and run an experiment before your next change. That habit is what turns a clever prototype into something you can trust in front of users.
We have come a long way in this series, from what an agent is to measuring one in production, and there is more still to come. If you build something from any of it, I would love to hear what you made.
Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.