Program-Aided Language Models (PAL), Explained with Code, Folarin Akinloye

Chain-of-Thought asks a language model to both reason and calculate in plain text. The reasoning is usually fine. The calculation is where it quietly breaks, because a model predicting tokens is not actually doing arithmetic, it is guessing what the arithmetic looks like. PAL fixes this with a clean split: the model reasons in language, but the moment there is something to compute, it writes code and hands it to a Python interpreter. The interpreter does not guess. It runs.

This is a standalone deep-dive in my prompting-techniques thread. It builds straight on Chain-of-Thought prompting, so if CoT is not fresh, skim that first.

Why CoT alone drops the ball on computation#

CoT works because writing out intermediate steps gives the model room to reason. But those steps are still generated tokens. When a step is "so 17 * 349 = ...", the model is predicting the most likely continuation, not multiplying. On multi-step arithmetic, date math, or anything with a lot of bookkeeping, small errors creep in and then compound through the rest of the chain.

Gao et al. (2022) proposed Program-Aided Language Models to fix the weak link. PAL keeps CoT's strength (decompose the problem in natural language) but changes the output: instead of free-form text that ends in a number the model asserts, the reasoning steps are a program, and the final answer comes from running that program in a real runtime like a Python interpreter.

The one-line version: CoT reasons to an answer. PAL reasons to a program, then executes it for the answer.

A worked example: date reasoning#

Date arithmetic is a perfect PAL target. It is fiddly, error-prone for a token predictor, and trivial for a library. Here is the pattern from the original work, using an LLM plus Python's dateutil.

You give the model few-shot examples where the "reasoning" is Python code, then ask your question:

from datetime import datetime
from dateutil.relativedelta import relativedelta
 
question = "Today is 27 February 2023. I was born exactly 25 years ago. What is the date I was born in MM/DD/YYYY?"
 
# The prompt shows several examples where each answer is written AS CODE, e.g.:
#
#   # Q: 2015 is coming in 36 hours. What is the date one week from today in MM/DD/YYYY?
#   # If 2015 is coming in 36 hours, then today is 36 hours before.
#   today = datetime(2015, 1, 1) - relativedelta(hours=36)
#   one_week_from_today = today + relativedelta(weeks=1)
#   one_week_from_today.strftime('%m/%d/%Y')
#
# ...then it ends with "# Q: {question}" for the model to continue.

The model does not answer with a date. It answers with code:

# If today is 27 February 2023 and I was born exactly 25 years ago, then I was born 25 years before.
today = datetime(2023, 2, 27)
# I was born 25 years before,
born = today - relativedelta(years=25)
# The answer formatted with %m/%d/%Y is
born.strftime('%m/%d/%Y')

Then you run it:

exec(llm_out)     # llm_out is the code above
print(born)       # -> 02/27/1998

The model figured out what to compute (subtract 25 years). Python figured out the actual date. Notice the comments in the generated code: that is the natural-language reasoning, sitting right alongside the executable steps. You get CoT's interpretability and correct computation at the same time.

Warning

That exec(llm_out) is running model-generated code. In a demo, fine. In production, never exec raw model output in your main process. Run it in a sandbox with no network, no filesystem, and a timeout. Model-written code plus exec is a remote code execution hole waiting to happen. This connects to the broader point in guardrails and safety for agents in production.

What PAL is really doing#

Two ideas are worth pulling out.

First, separation of concerns applied to reasoning. Language models are good at understanding a problem and structuring an approach. They are bad at executing precise computation. PAL routes each part to the thing that is good at it. That is the same instinct as ART, which pauses generation to call tools. PAL is the special case where the tool is a code interpreter and the "tool call" is the whole answer.

Second, the interpreter is ground truth. Once the model has written correct code, the answer is not a probabilistic guess anymore. It is deterministic. Run it twice, get the same number. That reliability is the entire selling point, and it is why PAL beat plain CoT on math word problems in the paper: the reasoning was comparable, but the execution stopped leaking errors.

When to reach for PAL#

PAL earns its keep whenever the hard part is computation rather than judgment:

Math word problems and multi-step arithmetic.
Date and time reasoning.
Unit conversions, financial calculations, anything with rounding rules.
Table and data manipulation where you would otherwise trust the model to "eyeball" a sum.
Symbolic work you can hand to a library (combinatorics, simple algebra).

It is the wrong tool when the task is about language, judgment, or open-ended reasoning with no crisp computation at the center. Writing, summarizing, classification, and fuzzy decisions do not get more correct by wrapping them in code.

PAL in 2026: you are already using its descendants#

You will rarely hand-build the few-shot PAL prompt today, because the pattern got absorbed into the tools you already use:

Code interpreter / tool use. When you ask a model to "write and run Python to compute this", that is PAL with native tool calling doing the pause-run-resume for you.
Reasoning models with tools. As I noted in prompting reasoning models, models like o3 will spin up code to nail a calculation as part of their reasoning. Same idea, built in.

The lasting lesson survives all of that: do not ask a language model to be a calculator. Ask it to write the calculation and let a real runtime do the arithmetic. If a number matters and it is more than trivial, get it out of token space and into an interpreter. That is PAL in one sentence, and it is as true now as it was in 2022.

This is a standalone deep-dive in my prompting-techniques thread. It builds straight on Chain-of-Thought prompting, so if CoT is not fresh, skim that first.

Why CoT alone drops the ball on computation#

The one-line version: CoT reasons to an answer. PAL reasons to a program, then executes it for the answer.

A worked example: date reasoning#

Date arithmetic is a perfect PAL target. It is fiddly, error-prone for a token predictor, and trivial for a library. Here is the pattern from the original work, using an LLM plus Python's dateutil.

You give the model few-shot examples where the "reasoning" is Python code, then ask your question:

from datetime import datetime
from dateutil.relativedelta import relativedelta
 
question = "Today is 27 February 2023. I was born exactly 25 years ago. What is the date I was born in MM/DD/YYYY?"
 
# The prompt shows several examples where each answer is written AS CODE, e.g.:
#
#   # Q: 2015 is coming in 36 hours. What is the date one week from today in MM/DD/YYYY?
#   # If 2015 is coming in 36 hours, then today is 36 hours before.
#   today = datetime(2015, 1, 1) - relativedelta(hours=36)
#   one_week_from_today = today + relativedelta(weeks=1)
#   one_week_from_today.strftime('%m/%d/%Y')
#
# ...then it ends with "# Q: {question}" for the model to continue.

The model does not answer with a date. It answers with code:

# If today is 27 February 2023 and I was born exactly 25 years ago, then I was born 25 years before.
today = datetime(2023, 2, 27)
# I was born 25 years before,
born = today - relativedelta(years=25)
# The answer formatted with %m/%d/%Y is
born.strftime('%m/%d/%Y')

Then you run it:

exec(llm_out)     # llm_out is the code above
print(born)       # -> 02/27/1998

Warning

What PAL is really doing#

Two ideas are worth pulling out.

When to reach for PAL#

PAL earns its keep whenever the hard part is computation rather than judgment:

Math word problems and multi-step arithmetic.
Date and time reasoning.
Unit conversions, financial calculations, anything with rounding rules.
Table and data manipulation where you would otherwise trust the model to "eyeball" a sum.
Symbolic work you can hand to a library (combinatorics, simple algebra).

PAL in 2026: you are already using its descendants#

You will rarely hand-build the few-shot PAL prompt today, because the pattern got absorbed into the tools you already use:

Code interpreter / tool use. When you ask a model to "write and run Python to compute this", that is PAL with native tool calling doing the pause-run-resume for you.
Reasoning models with tools. As I noted in prompting reasoning models, models like o3 will spin up code to nail a calculation as part of their reasoning. Same idea, built in.

PAL: Let the Model Reason in Words, but Let Python Do the Math

Why CoT alone drops the ball on computation#

A worked example: date reasoning#

What PAL is really doing#

When to reach for PAL#

PAL in 2026: you are already using its descendants#

Related articles

Active-Prompt: Stop Guessing Which Few-Shot Examples to Annotate

ART: Let the Model Write Its Own Tool-Using Reasoning

Structured Outputs and Function Calling, In Depth

PAL: Let the Model Reason in Words, but Let Python Do the Math

Why CoT alone drops the ball on computation#

A worked example: date reasoning#

What PAL is really doing#

When to reach for PAL#

PAL in 2026: you are already using its descendants#

Related articles

Active-Prompt: Stop Guessing Which Few-Shot Examples to Annotate

ART: Let the Model Write Its Own Tool-Using Reasoning

Structured Outputs and Function Calling, In Depth