Structured Outputs and Function Calling, In Depth
How to make an LLM return data that always fits your schema, and how function calling is the same idea wearing a different hat
If you have ever parsed an LLM response with a regex and a prayer, structured outputs is the feature you have been waiting for. It makes the model return data that is guaranteed to match a schema you define, not "usually valid JSON if you ask nicely." This post is how it works under the hood, why function calling is really the same mechanism, and the patterns that survive contact with production.
This builds on giving agents tools and MCP. Tool calling is where structured outputs matters most, because a tool call is only useful if the arguments are shaped correctly.
Three levels of "give me JSON"#
There are three distinct things people lump together, and the differences matter.
Prompt-and-pray. You ask the model for JSON in the prompt and hope. It works most of the time and fails exactly when you are not watching: a trailing comma, a markdown code fence wrapped around the JSON, a chatty "Sure, here is your JSON:" prefix. Do not build on this.
JSON mode. The model is constrained to emit syntactically valid JSON. Better, but valid JSON is not your JSON. It guarantees the braces match; it does not guarantee the fields you need exist or have the right types. JSON mode is legacy at this point; reach for the next level.
Structured outputs (strict schema). You provide a JSON Schema and the model's output is guaranteed to conform to it. Not validated after the fact, but constrained during generation. The decoder physically cannot emit a token that would violate the schema, because the guarantee is enforced at the sampling layer. If your schema says age is an integer and is required, you will get an integer named age, every time.
That last one is the one to use. Everything below is about it.
How the guarantee actually works#
This is the part worth understanding, because it explains the constraints. When the model generates a response, at each step it produces a probability distribution over the next token. Structured outputs masks that distribution: any token that would make the output stop matching the schema gets its probability set to zero before sampling. So the model can only ever pick tokens that keep the output valid.
That is why it is a hard guarantee and not a "we retry until it parses" trick. It is also why there are rules about what your schema can look like. The engine has to compile your schema into something it can enforce token by token, which means some JSON Schema features are restricted.
Using it with OpenAI#
You attach a schema and set strict mode. Here is the shape with the Python SDK, using a Pydantic model so you get a typed object back:
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI()
class Invoice(BaseModel):
vendor: str
amount_cents: int
currency: str
line_items: list[str]
resp = client.responses.parse(
model="gpt-5",
input=[
{"role": "system", "content": "Extract the invoice fields."},
{"role": "user", "content": raw_invoice_text},
],
text_format=Invoice,
)
invoice = resp.output_parsed # a typed Invoice, guaranteed to fitThe SDK turns your Pydantic model into a JSON Schema with strict mode on and parses the result back into the object. If you write the schema by hand instead, the strict-mode rules are the part to get right.
In strict mode, every field in properties must be listed as required, and every object must set additionalProperties: false. There is no concept of an optional field. To express "this might be missing," make the field required but allow null as a type, then treat null as absent in your code. This trips up everyone the first time.
class Contact(BaseModel):
name: str
email: str | None # required in the schema, but may be nullFunction calling is the same idea#
Here is the thing that clicks once you see it: function calling is structured outputs pointed at a tool's arguments.
When you give a model a tool, you describe the tool's parameters as a JSON Schema. When the model decides to call the tool, it has to produce arguments that match that schema. That is structured outputs doing its job. Turn on strict mode for the tool and the arguments are guaranteed to fit, which means your tool-handling code never has to defend against a missing field or a string where it wanted a number.
tools = [{
"type": "function",
"name": "get_weather",
"description": "Get current weather for a city.",
"strict": True,
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["city", "unit"],
"additionalProperties": False,
},
}]So the mental model is: "structured outputs" usually means "make the final answer fit a schema," and "function calling" means "make the tool's arguments fit a schema so I can actually run the tool." Same machinery, two jobs.
Anthropic does it through tools#
Claude's approach is worth a note because it is slightly different in shape but the same in spirit. Claude does not have a separate "response format" knob in the same way; you get structured data by defining a tool whose input schema is the structure you want, then reading the tool call's input. If you want a guaranteed Invoice object out of Claude, you define an Invoice tool and force the model to call it. The result is the same: typed, schema-shaped data instead of free text you have to parse.
Patterns that hold up in production#
A few things I have learned the hard way:
Handle refusals as a real case. The model can refuse to produce the structure (safety, or it genuinely cannot extract the data). A refusal is not malformed output, it is a different branch. Check for it explicitly rather than letting it fall through your parsing code.
Keep schemas flat and shallow. Deeply nested schemas with many optional-via-null fields are slower to generate and more error-prone. Flatten where you can. If you are extracting twenty fields, consider whether some of them really need to be one call.
Mind the output token limit. Structured outputs still has to fit in the output token budget. For large extractions (long lists, many records), you can hit the ceiling. Batch the input or stream records rather than asking for one giant object.
Validate the values, not just the shape. Structured outputs guarantees types and presence. It does not guarantee the amount_cents the model extracted is correct, or that the email is a real email. Schema conformance is not data correctness. Keep your business-rule validation.
Pin the model version when structured outputs is load-bearing. Schema-following behaviour can shift subtly between model versions, and you want that change to happen when you choose to test it, not silently in production.
Wrapping up#
Structured outputs constrains generation so the model can only emit tokens that keep the output matching your schema, which turns "parse the LLM's text and hope" into a hard guarantee. Function calling is the same mechanism applied to tool arguments. Use strict mode, remember that optional means required-but-nullable, handle refusals as their own case, and keep validating the actual values because schema conformance is not correctness.
If you are wiring this into an agent, the next thing to get right is where the agent keeps what it learns: agent memory, short-term vs long-term.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.