Directional Stimulus Prompting (DSP), Explained for Engineers, Folarin Akinloye

Most prompt techniques change what you write. Directional Stimulus Prompting changes who writes it. Instead of you hand-tuning a prompt for a big frozen model you cannot retrain, you train a small model to generate a hint that steers the big one. The big model stays a black box. The small model learns, through reinforcement learning, to whisper exactly the right nudge.

This is a standalone deep-dive in my prompting-techniques thread. It is one of the more research-flavored techniques, so I will keep the framing practical: what it is, why the shape of it matters, and when the idea is useful even if you never train a policy model yourself.

The setup: you can't touch the big model's weights#

Plenty of the strongest models are behind an API. You send text, you get text. You cannot fine-tune them, and even if you could, it would be expensive and slow to iterate on. So the only lever you have is the prompt.

Li et al. (2023) introduced Directional Stimulus Prompting for exactly this situation, originally to get better summaries out of a frozen LLM. The insight: you do not need to change the big model to steer it. You need to change its input in a learned, targeted way. So put a small, trainable model in front of it whose only job is to produce a good hint.

The two-model architecture#

DSP is two models with a clear division of labor:

The policy LM. Small, tuneable, cheap. For each input, it generates a "directional stimulus": a hint or set of keywords that points the big model at the right answer.
The frozen LLM. Big, powerful, untouchable. It reads your input plus the hint and produces the final output.

                 ┌────────────────────┐
   input ───────>│  policy LM (small) │───> hint / stimulus
     │           │   trained via RL   │        │
     │           └────────────────────┘        │
     │                                          v
     └──────────────────────────────>  ┌──────────────────┐
                                        │ frozen big LLM   │───> output
                                        │  (black box)     │
                                        └──────────────────┘

For a summarization task, the stimulus might be the key entities and facts the summary should hit. Standard prompting would just say "summarize this article". DSP adds a learned hint: "summarize this article; make sure to mention: [Bosch, Q3 earnings, EV battery plant, 12% revenue drop]". Those keywords are not hand-written. The policy model learned to produce them.

Where reinforcement learning comes in#

The reason DSP is interesting, and the reason it keeps getting cited as an early example of a bigger trend, is how the policy model learns. You cannot backprop through the frozen black-box LLM. There are no gradients to follow. So DSP trains the policy model with reinforcement learning: the hint is the action, the big model's output quality is the reward, and the policy model gets better at producing hints that make the frozen model score well.

That is a neat inversion. Normally RL-tuning an LLM means tuning the LLM. Here the giant model is fixed and treated as part of the environment. The thing you optimize is a small model that manipulates the input. You are doing RL against a black box you do not own.

Note

This is the pattern the technique is really famous for. Using a small trained model (or an optimizer) to shape the prompt for a frozen model is now a recognizable family of methods. DSP was an early, clean instance: RL to optimize how you talk to a model you cannot change.

How it compares to what you already know#

It helps to place DSP against the other prompt-optimization ideas in this series:

Approach	Who does the work	Needs model weights?
Manual prompt tuning	You, by hand	No
Automatic Prompt Engineer	An LLM proposes and scores prompts	No
Fine-tuning	Gradient updates to the model	Yes
Directional Stimulus Prompting	A small RL-trained policy model generates per-input hints	No (big model stays frozen)

The distinguishing feature is that DSP produces a hint per input, learned, rather than one static prompt for the whole task. It adapts the steer to each example, which is why summarization was the launch use case: the right hint for one article is the wrong hint for another.

Should you build this?#

Honestly, for most application work, no, not the full RL setup. Training a policy model with RL against a black box is real engineering effort, and it only pays off when you have a high-volume, well-defined task where small quality gains matter a lot and you genuinely cannot fine-tune the target model. Think large-scale summarization, controlled generation with hard content requirements, or a product surface where output shape has to be tightly steered.

But the idea generalizes, and this is the part worth carrying around:

A hint is a lever. When you cannot change a model, changing its input in a structured, learned way is the next best thing.
You can put a cheap model in front of an expensive one. The small model does the steering; the big one does the heavy lifting. That is a good cost pattern in general, and it rhymes with routing work I touched on in cutting LLM cost and latency.
Optimization against a black box is legitimate. You do not need gradients through the target to improve results. Reward signal plus a trainable front-end is enough.

Even a hand-built, non-RL version of DSP is useful: precompute per-input hints (extract key entities, retrieve relevant facts, classify the intent) and feed them alongside the raw input. You lose the "learned" part, but you keep the core move of steering with a targeted stimulus. That is often enough, and it connects straight to context engineering for agents, which is the same instinct applied to what you put in front of the model.

The takeaway#

Directional Stimulus Prompting answers a real constraint: how do you steer a model you are not allowed to change? Its answer, a small RL-trained policy model that emits per-input hints, is more machinery than most projects need. But the shape of it is worth internalizing. When the big model is frozen, optimize the thing feeding it. That principle outlives the specific paper, and you will see it again every time someone builds a small model to make a big one behave.

The setup: you can't touch the big model's weights#

The two-model architecture#

DSP is two models with a clear division of labor:

The policy LM. Small, tuneable, cheap. For each input, it generates a "directional stimulus": a hint or set of keywords that points the big model at the right answer.
The frozen LLM. Big, powerful, untouchable. It reads your input plus the hint and produces the final output.

                 ┌────────────────────┐
   input ───────>│  policy LM (small) │───> hint / stimulus
     │           │   trained via RL   │        │
     │           └────────────────────┘        │
     │                                          v
     └──────────────────────────────>  ┌──────────────────┐
                                        │ frozen big LLM   │───> output
                                        │  (black box)     │
                                        └──────────────────┘

Where reinforcement learning comes in#

Note

How it compares to what you already know#

It helps to place DSP against the other prompt-optimization ideas in this series:

Approach	Who does the work	Needs model weights?
Manual prompt tuning	You, by hand	No
Automatic Prompt Engineer	An LLM proposes and scores prompts	No
Fine-tuning	Gradient updates to the model	Yes
Directional Stimulus Prompting	A small RL-trained policy model generates per-input hints	No (big model stays frozen)

Should you build this?#

But the idea generalizes, and this is the part worth carrying around:

A hint is a lever. When you cannot change a model, changing its input in a structured, learned way is the next best thing.
You can put a cheap model in front of an expensive one. The small model does the steering; the big one does the heavy lifting. That is a good cost pattern in general, and it rhymes with routing work I touched on in cutting LLM cost and latency.
Optimization against a black box is legitimate. You do not need gradients through the target to improve results. Reward signal plus a trainable front-end is enough.

Directional Stimulus Prompting: Train a Tiny Model to Whisper Hints to a Big One

The setup: you can't touch the big model's weights#

The two-model architecture#

Where reinforcement learning comes in#

How it compares to what you already know#

Should you build this?#

The takeaway#

Related articles

Prompting Reasoning Models Is Almost the Opposite of Prompting Chat Models

Multimodal Chain-of-Thought: Reason Over the Picture, Then Answer

The Anatomy of a Good Prompt

Directional Stimulus Prompting: Train a Tiny Model to Whisper Hints to a Big One

The setup: you can't touch the big model's weights#

The two-model architecture#

Where reinforcement learning comes in#

How it compares to what you already know#

Should you build this?#

The takeaway#

Related articles

Prompting Reasoning Models Is Almost the Opposite of Prompting Chat Models

Multimodal Chain-of-Thought: Reason Over the Picture, Then Answer

The Anatomy of a Good Prompt