A Real Prompt Engineering Case Study: 65.6 to 91.7 F1 on Job Classification
What a production classification system teaches about which prompt tweaks actually move the needle
Most prompt engineering advice is vibes. This post is about the rare counterexample: a production system where every prompt modification was measured, one at a time, on a real task with real stakes. The gap between the naive prompt and the engineered one was 65.6 to 91.7 F1. Same model, same task, same data. The only thing that changed was the prompt.
The case study is Clavié et al. (2023), summarized in the Prompt Engineering Guide. The task: classify whether a job posting is a true entry-level job, suitable for a recent graduate. A binary classification with messy, real-world text, running in production at a recruitment company. They tested GPT-3.5 (gpt-3.5-turbo) against strong supervised baselines including DeBERTa-V3, and the LLM won, but only after the prompt work. That last clause is the entire story.
The headline numbers#
They tested fourteen modifications and measured precision, recall, F1, and "template stickiness" (how often the model answered in the requested format) for each combination. The full progression:
| Configuration | F1 | Stickiness |
|---|---|---|
| Baseline (just ask) | 65.6 | 79% |
| Few-shot CoT examples | 78.4 | 87% |
| Zero-shot "think step by step" | 81.4 | 65% |
| + role and task instructions (both messages) | 87.5 | 71% |
| + mock acknowledgement dialogue | 88.8 | 74% |
| + repeating key instructions | 89.3 | 75% |
| + "reach the right conclusion" | 89.6 | 77% |
| + info addressing common failures | 90.3 | 77% |
| + giving the model a name | 90.9 | 79% |
| + positive feedback before querying | 91.7 | 81% |
A 26-point F1 improvement without touching the model. If a fine-tuning run promised that, you would clear your quarter for it.
The findings that should change how you work#
Few-shot examples made things worse. For this task, where no expert knowledge is required, few-shot CoT (78.4) lost to zero-shot step-by-step reasoning (81.4) in every experiment. Examples anchor the model, and on tasks the model already understands, anchors mostly constrain. I wrote about when examples help and hurt in zero-shot vs few-shot prompting, and this study is the best evidence I know for the "start zero-shot" default.
Instructions do the heavy lifting. The single biggest jump came from properly telling the model its role and task, split across the system message (role) and user message (task). That plus repeating the key points carried most of the 26 points. Boring, unglamorous, decisive.
Forcing a strict output template cost accuracy. Asking the model to strictly follow an answer template pushed stickiness to 98% but dropped F1 from 89.3 to 86.3. Format pressure competes with task pressure. In 2023 you had to choose; today structured outputs give you the format guarantee without spending prompt budget on it, so you can have both (see structured outputs and function calling). But the underlying lesson stands: every constraint in your prompt taxes the model. Spend the budget on the task.
The silly-sounding stuff measured positive. Giving the model a human name and referring to it by that name: +0.6 F1. Offering positive feedback before the query: another measurable gain. Asking it to "reach the right conclusion": same. I would not build a strategy on these, and modern models likely respond differently, but the meta-lesson is real. Small framing changes have outsized, unpredictable effects, which is exactly why you measure instead of guessing.
None of these numbers transfer directly to your task or your model. The finding that transfers is the method: change one thing, measure, keep or revert. Prompt engineering without an eval set is just superstition with extra steps.
The method is the product#
Here is what the workflow actually looked like, and what I copy for my own classification work:
- Build a labeled test set first. A few hundred examples is enough to rank prompt variants.
- Establish the naive baseline. Just ask the model to classify. This is the number every change must beat.
- Change exactly one thing per experiment. Their tables read like an ablation study because that is what disciplined prompting is.
- Track format compliance separately from accuracy. A model that is right but unparseable is still a production failure. Their "template stickiness" metric is worth stealing verbatim.
- Accept task-specific results. Their few-shot finding contradicts the general folk wisdom. Both are right, on different tasks.
The paper also noted that gpt-3.5-turbo needed extra output parsing because it held to templates worse than older GPT-3 variants, and that the template-hurts-accuracy effect disappeared in early GPT-4 testing. Model-specific quirks like these are exactly why the eval loop matters more than any specific trick: the tricks expire, the loop does not.
Would an LLM still be the right call in 2026?#
Worth asking. For a fixed, high-volume binary classification like this, a fine-tuned small model trained on LLM-labeled data is often cheaper at scale now, and I have written about that tradeoff in fine-tuning vs RAG vs prompting. But the economics usually play out in sequence: prompt-engineer an LLM to 90+ F1 first (days of work), ship it, then use it to label training data for the small model if volume justifies it. The case study's approach is still step one of that sequence, three years later.
The next time someone tells you prompt engineering is dead, show them the table above and ask which of their systems would turn down 26 points of F1 for a week of disciplined experiments.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.