Multimodal Chain-of-Thought: Reason Over the Picture, Then Answer
A two-stage framework that generates a rationale from text and image first, then infers the answer from that rationale
Chain-of-Thought taught models to reason step by step, but only over text. The moment the problem includes a diagram, a chart, or a photo, plain CoT has nothing to reason about. Multimodal CoT extends the idea to text and images together, and it does it with one design choice that turns out to matter a lot: reason first, answer second, in two separate stages.
This is a standalone deep-dive in my prompting-techniques thread. It assumes you know Chain-of-Thought prompting. If you do, the multimodal jump is small and the interesting part is the architecture.
The problem: reasoning that ignores half the input#
Think about a physics question with a diagram, or a biology question with a labeled figure. A text-only CoT model reads the words and reasons about them, but the picture carries information the words assume you can see. The model is reasoning with one eye closed.
The naive fix is to just feed a multimodal model the image and the question and ask for an answer. That works for easy cases and falls over on anything that needs actual reasoning across the two modalities, because the model tries to jump straight to an answer while juggling perception and reasoning at once.
Zhang et al. (2023) proposed Multimodal CoT to handle this properly. Traditional CoT lives entirely in the language modality. Multimodal CoT brings vision in and, crucially, splits the work into two stages.
The two-stage framework#
This is the whole idea, and it is worth being precise about it.
Stage 1: rationale generation. Feed the model the text and the image. Its job here is not to answer. It is to produce a rationale, a reasoning chain that pulls together what the question says and what the image shows.
Stage 2: answer inference. Feed the model the original input plus the rationale it just generated. Now it produces the final answer, grounded in that rationale.
Stage 1 (rationale generation)
text + image ──────────────────────────► rationale
│
Stage 2 (answer inference) │
text + image + rationale ───────────────► final answerThe split is the point. By generating the rationale as its own step, the model gets to fuse the visual and textual information into an explicit chain before it commits to an answer. Stage 2 then reasons over a clean, informative rationale instead of trying to perceive and conclude in a single pass. Separating "understand the scene" from "decide the answer" gives you noticeably better answers on problems that actually need both.
The result that made people notice#
The headline from the paper: a Multimodal CoT model with only about 1 billion parameters outperformed GPT-3.5 on the ScienceQA benchmark, which is a set of multimodal science questions. A 1B model beating a much larger text-centric model was a real "wait, what" moment in 2023.
The reason is not that the small model is secretly smarter. It is that the architecture matched the problem. ScienceQA questions genuinely need the image, and the two-stage rationale-then-answer design used the image well. Structure beat raw scale on a task where structure was what mattered.
This is a recurring theme across this whole series of techniques: a good reasoning structure can outrun a bigger model on the right task. PAL does it by offloading math, ART by using tools, Multimodal CoT by staging perception before conclusion. The prompt or the framework is doing real work, not just decoration.
Why the two-stage split still matters#
Today's frontier vision-language models (Gemini 2.5, o3, Claude with vision) can reason over images natively, so you rarely wire up the literal two-model pipeline from the paper. But the design lesson is very much alive, and you can apply it by hand in a prompt:
- Ask for the observation before the answer. "First describe what the chart shows, then answer the question" is Multimodal CoT in miniature. You are forcing stage 1 before stage 2 inside a single model.
- Separate perception errors from reasoning errors. When a vision model gets something wrong, a visible rationale tells you whether it misread the image or reasoned badly about a correct reading. Those are different bugs with different fixes.
- Ground the answer in stated evidence. Making the model write down what it saw before concluding reduces the "confidently wrong about the picture" failure mode.
As I mentioned in prompting reasoning models, the newest reasoning models can even manipulate images inside their chain-of-thought (zoom, crop, rotate) as part of stage 1. That is the same rationale-first instinct, just with tools attached and the whole thing folded into one model.
A practical prompt pattern#
You do not need two models to get the benefit. On any capable vision model, structure the prompt to force the stages:
You will answer a question about the attached image.
Step 1 - Observe: Describe only what is actually visible in the image
that is relevant to the question. Do not answer yet.
Step 2 - Reason: Using your observation and the question, work through
the reasoning.
Step 3 - Answer: Give the final answer.That "do not answer yet" in step 1 is doing real work. It stops the model from racing to a conclusion and then rationalizing backward, which is exactly the failure the two-stage design was built to prevent.
The takeaway#
Multimodal CoT is Chain-of-Thought that can finally see, and its lasting contribution is the two-stage split: build a rationale from text and image first, then answer from the rationale. Even with modern models that fuse vision and reasoning natively, forcing "observe, then reason, then answer" makes vision tasks more accurate and far easier to debug. When an image is part of the question, do not let the model skip straight to the answer. Make it look first.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.