Directional Stimulus Prompting: Train a Tiny Model to Whisper Hints to a Big One
A small tuneable policy model generates hints that steer a frozen black-box LLM, no access to its weights required
Most prompt techniques change what you write. Directional Stimulus Prompting changes who writes it. Instead of you hand-tuning a prompt for a big frozen model you cannot retrain, you train a small model to generate a hint that steers the big one. The big model stays a black box. The small model learns, through reinforcement learning, to whisper exactly the right nudge.
This is a standalone deep-dive in my prompting-techniques thread. It is one of the more research-flavored techniques, so I will keep the framing practical: what it is, why the shape of it matters, and when the idea is useful even if you never train a policy model yourself.
The setup: you can't touch the big model's weights#
Plenty of the strongest models are behind an API. You send text, you get text. You cannot fine-tune them, and even if you could, it would be expensive and slow to iterate on. So the only lever you have is the prompt.
Li et al. (2023) introduced Directional Stimulus Prompting for exactly this situation, originally to get better summaries out of a frozen LLM. The insight: you do not need to change the big model to steer it. You need to change its input in a learned, targeted way. So put a small, trainable model in front of it whose only job is to produce a good hint.
The two-model architecture#
DSP is two models with a clear division of labor:
- The policy LM. Small, tuneable, cheap. For each input, it generates a "directional stimulus": a hint or set of keywords that points the big model at the right answer.
- The frozen LLM. Big, powerful, untouchable. It reads your input plus the hint and produces the final output.
┌────────────────────┐
input ───────>│ policy LM (small) │───> hint / stimulus
│ │ trained via RL │ │
│ └────────────────────┘ │
│ v
└──────────────────────────────> ┌──────────────────┐
│ frozen big LLM │───> output
│ (black box) │
└──────────────────┘For a summarization task, the stimulus might be the key entities and facts the summary should hit. Standard prompting would just say "summarize this article". DSP adds a learned hint: "summarize this article; make sure to mention: [Bosch, Q3 earnings, EV battery plant, 12% revenue drop]". Those keywords are not hand-written. The policy model learned to produce them.
Where reinforcement learning comes in#
The reason DSP is interesting, and the reason it keeps getting cited as an early example of a bigger trend, is how the policy model learns. You cannot backprop through the frozen black-box LLM. There are no gradients to follow. So DSP trains the policy model with reinforcement learning: the hint is the action, the big model's output quality is the reward, and the policy model gets better at producing hints that make the frozen model score well.
That is a neat inversion. Normally RL-tuning an LLM means tuning the LLM. Here the giant model is fixed and treated as part of the environment. The thing you optimize is a small model that manipulates the input. You are doing RL against a black box you do not own.
This is the pattern the technique is really famous for. Using a small trained model (or an optimizer) to shape the prompt for a frozen model is now a recognizable family of methods. DSP was an early, clean instance: RL to optimize how you talk to a model you cannot change.
How it compares to what you already know#
It helps to place DSP against the other prompt-optimization ideas in this series:
| Approach | Who does the work | Needs model weights? |
|---|---|---|
| Manual prompt tuning | You, by hand | No |
| Automatic Prompt Engineer | An LLM proposes and scores prompts | No |
| Fine-tuning | Gradient updates to the model | Yes |
| Directional Stimulus Prompting | A small RL-trained policy model generates per-input hints | No (big model stays frozen) |
The distinguishing feature is that DSP produces a hint per input, learned, rather than one static prompt for the whole task. It adapts the steer to each example, which is why summarization was the launch use case: the right hint for one article is the wrong hint for another.
Should you build this?#
Honestly, for most application work, no, not the full RL setup. Training a policy model with RL against a black box is real engineering effort, and it only pays off when you have a high-volume, well-defined task where small quality gains matter a lot and you genuinely cannot fine-tune the target model. Think large-scale summarization, controlled generation with hard content requirements, or a product surface where output shape has to be tightly steered.
But the idea generalizes, and this is the part worth carrying around:
- A hint is a lever. When you cannot change a model, changing its input in a structured, learned way is the next best thing.
- You can put a cheap model in front of an expensive one. The small model does the steering; the big one does the heavy lifting. That is a good cost pattern in general, and it rhymes with routing work I touched on in cutting LLM cost and latency.
- Optimization against a black box is legitimate. You do not need gradients through the target to improve results. Reward signal plus a trainable front-end is enough.
Even a hand-built, non-RL version of DSP is useful: precompute per-input hints (extract key entities, retrieve relevant facts, classify the intent) and feed them alongside the raw input. You lose the "learned" part, but you keep the core move of steering with a targeted stimulus. That is often enough, and it connects straight to context engineering for agents, which is the same instinct applied to what you put in front of the model.
The takeaway#
Directional Stimulus Prompting answers a real constraint: how do you steer a model you are not allowed to change? Its answer, a small RL-trained policy model that emits per-input hints, is more machinery than most projects need. But the shape of it is worth internalizing. When the big model is frozen, optimize the thing feeding it. That principle outlives the specific paper, and you will see it again every time someone builds a small model to make a big one behave.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.