What Died, What Survived: A Field Report
Honest accounting of the 2023-era toolbox, tested against modern instruction-tuned and reasoning models:
Mostly dead
- Emotional manipulation — "this is very important to my career", tip offers, threats. Measurable effect on current frontier models: negligible.
- Magic incantations — "take a deep breath", "think step by step" appended to everything. Reasoning-class models plan internally; the phrase neither helps nor hurts much. (On small non-reasoning models it can still help.)
- Role-play as a quality lever — "you are the world's best copywriter" adds little beyond what a concrete description of the task and audience adds. Roles still matter for voice and perspective, not for competence.
- Prompt-hacking verbosity — repeating instructions three times, decorating with special tokens. Modern models follow clean instructions better than incantation-stuffed ones.
Very much alive
- Being specific about the task — the model cannot read the requirements doc in your head. Concrete inputs, outputs, constraints, and edge-case behavior remain the highest-leverage words you write.
- Examples (few-shot) — still the single most reliable way to communicate format, tone, and judgment calls that are hard to describe.
- Output contracts — explicit schemas, delimiters, and "if X, output exactly Y" rules; the backbone of anything programmatic.
- Structure and ordering — separating instructions from data, putting stable content first (also what prompt caching rewards).
- Giving the model an out — "if the answer is not in the document, say so" remains a potent hallucination reducer.
The pattern behind the pattern: tricks aimed at the model's psychology died; techniques that add information survived. Modern prompt engineering is requirements writing, not spellcasting.
The Six-Part Prompt: A Structure That Scales
Production prompts converge on a recognizable anatomy. Not every prompt needs all six parts, but knowing the slots turns prompt writing from improvisation into checklist work:
1. ROLE / CONTEXT Who is speaking, to whom, in what situation
2. TASK The one thing to do, stated imperatively
3. CONSTRAINTS Length, tone, what to avoid, what to never do
4. EXAMPLES 1–5 input→output pairs showing judgment calls
5. INPUT DATA The actual material, clearly delimited
6. OUTPUT FORMAT Exact structure of the response (schema, sections)
Rules that repeatedly prove their worth across teams:
- Separate instructions from data with delimiters. XML-style tags, triple quotes, or markdown fences — so "summarize this email" never bleeds into obeying instructions inside the email (the root of prompt injection).
- State constraints positively where possible. "Write in plain prose" beats a list of ten forbidden words; models over-attend to negations and sometimes do the opposite.
- Put the request after the data for long contexts. With big documents, models respond measurably better when the question follows the material rather than preceding it by 50,000 tokens.
- One prompt, one job. A prompt that classifies AND extracts AND rewrites does all three worse than three focused prompts chained. Decomposition remains the most underused technique in the field.
- Version prompts like code. They are code — store them in git, diff them, review changes. "Someone edited the prompt in the dashboard and quality dropped" is 2026's "someone edited prod directly".
Few-Shot Examples: Still the Heavyweight Champion
When output quality disappoints, adding two or three well-chosen examples fixes it more often than any other single intervention. Examples transmit what instructions struggle to: tone, level of detail, how to handle ambiguity, what "good" looks like.
What the research and practice agree on about choosing them:
- Quality and representativeness beat quantity. Two examples spanning your input diversity beat eight near-duplicates. Beyond ~5, gains typically flatten while cost rises.
- Include a hard case. An example showing the correct handling of an ambiguous or edge input teaches more than three easy ones. If your task has a "reject / cannot answer" branch, always show it — otherwise the model will force an answer.
- Format is contagious — ruthlessly so. Models copy example formatting more faithfully than instructions describe it. If your examples use bullet fragments, output will too, whatever the instructions say. This cuts both ways: fixing example formatting fixes output formatting.
- Mind the label distribution. In classification, models skew toward labels overrepresented in the examples; balance them.
- For reasoning-class models, show input→output, not worked reasoning. Demonstrating your chain of thought can actually constrain a reasoning model into shallower paths than it finds on its own; give it the destination, let it drive.
The workflow that produces good example sets is unglamorous: collect real failures from production or testing, write the ideal output for each, and promote the most instructive ones into the prompt. Your examples become a distilled spec of every lesson learned.
Structured Output: From "Please Return JSON" to Contracts
Anything downstream of an LLM that parses its output needs the output to be reliable — and this area matured from begging to engineering:
- Use native structured output where available. Major APIs now accept a JSON Schema and constrain generation to it (OpenAI structured outputs, tool/function calling on Anthropic and others). This eliminates the parse-failure class entirely — no more regexing JSON out of markdown fences.
- When you must prompt for structure: show the exact schema with a filled example, specify behavior for missing data ("use null, never omit keys"), and say "output only the JSON, no preamble". Then still parse defensively.
- Design schemas for the model, not just the parser. Field names carry meaning: extracting "customer_complaint_summary" outperforms extracting "field_3". Descriptions in the schema are read as instructions.
- Order fields to exploit generation order. Because output is produced left to right, putting a "reasoning" or "evidence" field before the "verdict" field lets the model commit to analysis before conclusion — a schema-level chain of thought that measurably improves verdict accuracy on non-reasoning models.
- Enumerate, don't describe. A field defined as an enum of five allowed values beats "categorize appropriately" every time. Free-text fields are where inconsistency lives.
The mental shift: treat the model's output as an API response with a contract, test the contract on adversarial inputs, and monitor contract violations in production like any error rate.
From Craft to Engineering: Evals, Iteration, and Maintenance
The difference between prompt writing and prompt engineering is the feedback loop. The craft's mature workflow, which every serious team converges on:
- 1. Build a test set before optimizing. 20–100 real inputs with expected outputs (or grading criteria). Without it, prompt iteration is a random walk guided by the last example you looked at.
- 2. Change one thing at a time and measure. Prompt edits interact unpredictably; bundle five changes and you learn nothing from the result.
- 3. Test on the messy inputs. Prompts tuned on clean examples shatter on real user input — typos, mixed languages, half-questions, hostile content. Your eval set should look like production, not like documentation.
- 4. Use LLM-as-judge carefully. Grading outputs with a second model scales evaluation, but calibrate the judge against human ratings first, and never let the judged model grade itself on style questions.
- 5. Re-run evals on every model upgrade. Model updates silently shift behavior; the prompt that was optimal on last quarter's model may underperform on this quarter's. Pin model versions in production and upgrade deliberately, eval report in hand.
- 6. Write prompts for the next maintainer. Six months from now someone (probably you) must modify the prompt without breaking eleven implicit behaviors. Comments in the template, a linked eval set, and a changelog turn prompts from haunted artifacts into maintainable assets.
Prompt engineering did not die — it professionalized. The incantations went away; the requirements writing, example curation, output contracts, and regression testing remain. Which is to say: it became software engineering.