Prompt optimization is the practice of improving a prompt against measurable goals — accuracy, cost, latency — instead of editing it by feel. Where prompt engineering is manual craft, prompt optimization is a measurement: you test prompts against your own data, score each one, and keep the version nothing else beats. This glossary defines the vocabulary that shows up once you take that seriously.
The language around large language models has split into three overlapping crafts: writing instructions by hand (prompt engineering), measuring and automating their improvement (prompt optimization), and designing the entire context a model sees (context engineering). The terms below span all three. Read it top to bottom for a tour, or jump to the section you need.
Jump to a section
-
Foundations
— prompts, tokens, and the settings that shape every call -
Prompting techniques
— the moves that change how a model answers -
Evaluation and data
— how you measure whether a prompt is any good -
Cost, latency, and the frontier
— the trade-offs optimization actually navigates -
Automatic prompt optimization
— letting algorithms search for better prompts -
Reasoning and the 2026 frontier
— the newest vocabulary worth knowing -
Reliability and failure modes
— the ways prompts go wrong -
FAQ
Foundations
Prompt
The input you give a model to get an output — instructions, questions, examples, context, or any mix of them. Everything else in this glossary is about making that input work harder.
Prompt template
A reusable prompt with named placeholders you fill in at runtime, so one structure serves many inputs instead of being rewritten for each.
Fill the {{VARIABLE}} markers with real values at call time:
Summarize this {{DOCUMENT}} in {{N}} bullet points.
Prompt engineering
The manual craft of writing and refining prompts to get better results, without changing the model's weights. It is skilled work, but it is judged by feel: you change some wording, eyeball a few outputs, and decide it looks better.
Prompt optimization
The systematic, often automated version of prompt engineering: you define objectives and a set of test cases, then search for the prompt that scores best against them. Engineering produces a prompt; optimization tells you how good it is and finds a better one.
System prompt
A high-priority instruction that sets a model's role, rules, and tone for an entire conversation, separate from the user's individual messages. Providers handle message layers differently, but the rule is the same: keep stable policy here, and put changing facts in retrieval, tools, or user context.
Token
The atomic unit of text a model reads and writes. "Roughly four characters" is a useful English rule of thumb, but tokenization varies by model, language, code, and even hidden reasoning or tool-call payloads. Pricing, context limits, and latency are all counted in tokens, which is why a shorter prompt at equal quality is almost always the better one.
Context window
The maximum number of tokens a model can consider at once, counting input, generated output, and sometimes internal reasoning or tool traces. Bigger windows help, but they do not remove the need to curate context: stale history, noisy retrievals, and duplicated examples still dilute the signal.
Temperature
A sampling setting, usually between 0 and 2, that controls randomness. Low temperature makes output focused and repeatable; high temperature makes it more varied. For evaluation and optimization you usually want it low, so a prompt's score reflects the prompt and not the dice.
Top-p and top-k
Alternative ways to control randomness. Top-k limits the model to its k most likely next tokens; top-p (nucleus sampling) limits it to the smallest set of tokens whose probabilities add up to p. Both are levers for the same trade-off temperature governs: focus versus variety.
Max tokens
A cap on how many tokens a model may generate in a single response. It bounds cost and latency, but set it too low and the model gets cut off mid-answer.
Prompting techniques
Zero-shot prompting
Asking a model to do a task with no examples, relying only on your instructions and what it learned in training. It is the simplest prompt and usually the right baseline to measure everything else against.
Few-shot prompting
Including a handful of worked examples in the prompt to show the model the pattern you want; one example is one-shot, several is few-shot. Examples often lift quality sharply, but each one adds tokens, so they cost more on every call.
In-context learning
The underlying ability that makes few-shot work: a model picking up a task from examples in the prompt alone, with no change to its weights. The lesson lasts only for that one request.
| Technique | Examples in the prompt | Best for |
|---|---|---|
| Zero-shot | None | Simple, well-known tasks; the cheapest option and the usual baseline |
| One-shot | One | Pinning down an exact output format |
| Few-shot | Several | Steering tone, structure, or tricky edge cases |
Chain-of-thought (CoT)
Prompting the model to work through intermediate steps before giving a final answer, which can improve accuracy on math, logic, and multi-step problems. On modern reasoning models the raw chain-of-thought is often hidden, so production prompts usually ask for a concise rationale or just the final answer. The cost is more reasoning or output tokens, and therefore more money and latency.
Self-consistency
Running several independent reasoning attempts and taking the majority, best-scored, or judge-approved answer. It trades extra calls for reliability on hard problems where one reasoning path is brittle, but it only works well when the answer can be compared or verified.
ReAct
A pattern that interleaves reasoning with actions: the model thinks, calls a tool or runs a search, reads the result, and continues. Modern agent APIs often expose this as tool calls plus observations, so the tool schema and permission boundary matter as much as the prompt wording.
Tool calling / function calling
Letting a model choose an external function, emit structured arguments, and continue from the result. Use it when the model needs data or actions outside its weights; use structured output when you only need the final answer to match a schema. In production, validate arguments, keep tools scoped, and treat tool results as untrusted context.
Role prompting
Telling the model who to be — "you are a senior tax accountant" — to steer vocabulary, tone, and rigor. Useful, but a persona is not a substitute for clear instructions or examples.
Meta-prompting
Using one LLM to write or improve prompts for another. It is the manual seed of automatic prompt optimization, and a fast way to get a stronger first draft than you would write by hand.
Prompt chaining
Breaking a task into a sequence of smaller prompts, each handling one step and feeding the next. Chaining makes complex workflows debuggable, because you can see and fix the exact step that failed.
Structured output
Constraining a model to return machine-readable output that follows a schema — often JSON — so downstream code can parse it reliably. Many providers now enforce schemas directly, but support is still provider- and model-dependent, and schema-constrained final answers are different from tool-call arguments.
Evaluation and data
Eval
Short for evaluation: the systematic measurement of how well a model or prompt performs, using fixed datasets, rubrics, human ratings, or other models as graders. Evals are what turn "the prompt seems fine" into a number you can defend.
Golden dataset
A curated, expert-checked collection of inputs paired with their correct outputs, used as the benchmark you score every prompt against. A good golden set is small but ruthlessly clean, versioned, and split into optimization and held-out slices — its quality is the ceiling on everything you measure.
Ground truth
The canonical correct answer for an example — the label a prediction is graded against. When ground truth is wrong or ambiguous, your scores measure the data, not the prompt.
LLM-as-a-judge
Using a language model to grade another model's outputs against a rubric, which scales evaluation to thousands of cases too slow to review by hand. It is powerful and now standard, but judges carry known biases — position, verbosity, style, and self-preference — so serious setups use reference answers, swapped order, calibration sets, and human spot checks.
Synthetic data generation
Using an LLM to manufacture test or training examples, often to cover edge cases real data misses or to avoid exposing sensitive records. Synthetic eval sets still need checking: a generated label can be wrong in the same confident way a model output can.
Eval-driven development
Building and changing an LLM system against an explicit eval set rather than by impression — write the eval first, then let the score decide whether a prompt, model, or setting change ships. It is the discipline that makes prompt optimization possible at all.
Benchmark
A fixed, standardized test set used to compare models or prompts on equal footing. Public benchmarks measure general ability, but they can be overfit, contaminated, or irrelevant to your workflow; your own eval set measures the only thing that pays the bills — performance on your task.
Eval leakage
When examples, labels, rubrics, or judge preferences from the eval set leak into optimization, making a prompt look better than it really is. The defense is a blind hold-out set, versioned data splits, and logs showing which candidates saw which examples.
Cost, latency, and the frontier
Multi-objective optimization
Optimizing for several goals that pull against each other — most often quality versus cost, or quality versus latency. Because a single solution rarely wins on every goal at once, the output is not one answer but a set of trade-offs.
Pareto frontier
Borrowed from economics, the Pareto frontier is the set of options representing the best possible trade-offs: on the frontier, you cannot improve quality without paying more, or cut cost without losing quality. Every option off the frontier is beaten by something on it. The practical takeaway is that there is no single best prompt — only a frontier of best trade-offs, and you pick the point that fits your priorities. You can watch this play out on a real support-routing task, where the frontier offered one prompt more accurate than production and another about 40% cheaper.
Dominated solution
A prompt is dominated when another beats it on every objective at once — the same or better quality for the same or lower cost. Dominated prompts are the easy discards; the real decisions all live on the frontier, among the options nothing dominates.
Cost per token
The price of a model call, quoted per million input and output tokens, with output usually costing several times more than input. Reasoning tokens, cached tokens, tool-call payloads, and batch discounts can all have different economics, so measure the actual usage object instead of estimating from prompt length alone.
Prompt caching
Reusing a stable prompt prefix — system instructions, long documents, examples, or tool definitions — so repeated calls are faster or cheaper. It works best when static content appears before dynamic user content, and cache hits usually require exact or provider-defined prefix matches.
Latency and p95 latency
Latency is how long a model takes to respond. The average hides the tail, so teams track p95 latency — the slowest 5% of calls — because that tail is what users actually feel. Bigger models, longer prompts, reasoning tokens, retries, and tool calls all push it up.
Model routing
A layer that inspects each request and sends it to the cheapest setup that can still handle it: easy queries to a small fast model, hard ones to a frontier model, or different reasoning effort, tool access, and retry policy. Routing applies the quality-versus-cost trade-off per request, but it needs eval-backed fallback rules or savings can turn into silent quality loss.
Throughput
How many tokens a system can generate per second — the speed metric for serving at scale, as opposed to the latency of any single request.
Automatic prompt optimization
Automatic prompt optimization (APO)
A family of methods that improve prompts automatically: generate candidate instructions, examples, or prompt programs; score them against an eval set; keep the winners; and repeat — all without touching model weights. It turns prompt writing from craft into search, and a 2025 survey catalogs the fast-growing zoo of approaches.
Automatic Prompt Engineer (APE)
An early automatic method that has an LLM propose many candidate instructions from examples or input-output pairs, then keeps the ones that score best on the task. The "an LLM writes the prompt, the data picks the winner" idea in its simplest form.
OPRO
Short for Optimization by PROmpting: a method that uses the LLM itself as a black-box optimizer, feeding it past prompts and their scores so it proposes better ones each round. The optimization history becomes part of the meta-prompt, which makes context management part of the method.
MIPRO
A DSPy optimizer that tunes both instructions and in-prompt demonstrations across multi-step LLM programs, using the downstream metric rather than module-level labels. It matters because complex pipelines often fail from interactions between steps, not one bad instruction.
DSPy
A framework that lets you write an LLM pipeline as declarative code with metrics, then compiles prompts, demonstrations, and sometimes other parameters for you. It popularized treating prompts as artifacts a compiler optimizes rather than strings you hand-tune.
TextGrad and ProTeGi
Methods that treat natural-language feedback as a gradient metaphor: the model critiques a prompt's failures in words, and those critiques drive the next revision. ProTeGi combines textual critiques with beam search and bandit selection; TextGrad generalizes textual feedback across computation graphs. The gradient is linguistic, not a literal derivative.
Prompt tuning
Training a small vector — a "soft prompt" — that is prepended to the input and learned from labeled data through gradient descent. Despite the name it is closer to fine-tuning than to prompt engineering, and it needs model access most API users do not have.
| Approach | What changes | Needs training data | Model access |
|---|---|---|---|
| Prompt engineering | The prompt text | No | An API key is enough |
| Prompt tuning | A learned soft-prompt vector | Yes | Weights or embeddings |
| Fine-tuning | The model's own weights | Yes, more of it | Full training access |
Reasoning and the 2026 frontier
Reasoning models
Models trained or served to spend extra internal compute on hard problems before answering. They lift performance on coding, math, planning, and multi-step tool use, but their raw reasoning is often hidden and their thinking still consumes billed tokens and context budget.
Test-time compute
The idea that spending more computation at inference — higher reasoning effort, multiple samples, verifiers, retries, or tool loops — can buy better answers without changing the model. It reframes quality as something you can dial up per request, at a price.
Reasoning effort
A developer-facing knob that controls how much a reasoning model thinks before answering. The names are provider- and model-dependent — low/medium/high, minimal, xhigh, thinking budgets, or adaptive thinking — but the trade-off is the same: higher effort can help hard tasks and waste money on easy ones.
Reasoning tokens
Internal tokens a reasoning model uses to plan, check, or continue between tool calls before producing visible output. They may not be exposed as raw text, but they can count toward billing, output limits, and context-window pressure, so leave room for them when setting max tokens.
Context engineering
The successor framing to prompt engineering that took hold as agent systems matured: designing the whole payload a model sees — instructions, retrieved documents, tool outputs, history, memory, summaries, and permissions — as a managed resource. As systems became agents, the context, not one clever instruction, became the hard part.
Context compaction
Compressing or trimming old conversation history, tool results, and retrieved material so an agent keeps the facts that matter without dragging every token forward. Good compaction preserves commitments, decisions, and unresolved tasks; bad compaction creates expensive amnesia.
Agentic prompting
Prompting patterns for autonomous, multi-step agents that plan, call tools, observe results, and decide what to do next. The unit of design shifts from a single response to a loop, including permissions, stop conditions, escalation rules, and how state is carried forward.
Model Context Protocol (MCP)
An open protocol for connecting AI applications to external tools, resources, and prompt templates through a standard client-server interface. For prompt optimization, MCP matters because the "prompt" increasingly includes tool descriptions, schemas, and fetched resources, not just natural-language instructions.
Retrieval-augmented generation (RAG)
Fetching relevant documents at query time and inserting them into the prompt, so the model answers from current, specific facts instead of memory. It is the standard fix for stale knowledge, but retrieval quality, source attribution, and grounding checks decide whether it reduces hallucination or just gives wrong answers citations.
Reliability and failure modes
Hallucination
A confident, fluent output that is wrong or fabricated. Hallucination is the failure mode that makes evals non-negotiable: fluency is not accuracy, and only measurement tells them apart.
Grounding
Tying a model's output to verifiable source material — retrieved documents, databases, citations — so claims can be checked rather than trusted. Grounding is the main defense against hallucination, but a citation is only useful if the answer is actually supported by the cited text.
Prompt injection
An attack that hides malicious instructions inside content a model reads, hijacking its behavior. Direct injection comes from the user; indirect injection comes from web pages, documents, emails, or tool outputs. OWASP treats it as a top LLM-app risk because every untrusted text source can become an instruction source unless tools and data are separated carefully.
Guardrails
Checks around a model that constrain its inputs and outputs — blocking unsafe requests, filtering responses, enforcing formats, validating tool arguments — independently of the prompt itself. Use deterministic checks for hard constraints and semantic classifiers for fuzzy ones; a prompt is not a security boundary.
Knowledge cutoff
The date after which a model has no training knowledge, usually tied to a specific model snapshot. Anything newer, private, or fast-changing has to be supplied at run time through retrieval or tools, or the model may guess.
Fine-tuning
Updating a model's weights by training it further on task-specific data, as opposed to changing only the prompt. It can bake in style, domain behavior, or narrow formats that prompting cannot reliably reach, but it is the wrong tool for fresh facts, permissions, or weak evals — which is why most teams optimize the prompt first.
Frequently asked questions
Is prompt optimization the same as prompt engineering?
No. Prompt engineering is the manual craft of writing and tweaking prompts. Prompt optimization is the measured, often automated version: you define objectives and an eval set, then search for the prompt that scores best. Engineering produces a prompt; optimization tells you how good it is and finds a better one.
How do you measure prompt quality?
You build an eval set — inputs paired with correct outputs — and score the prompt's outputs against it, by exact match, a rubric, or an LLM-as-a-judge. The score is only as trustworthy as the labels and split hygiene, so a clean golden dataset and blind hold-out come first.
Can prompts really be optimized automatically?
Yes. Methods like APE, OPRO, and MIPRO, and frameworks like DSPy, generate instructions or examples, score them against your data, and keep the winners — with no model retraining. Automatic prompt optimization routinely finds wordings a person would not think to try.
How do you reduce LLM costs without losing quality?
Find the quality-versus-cost frontier: often a shorter prompt, prompt caching, lower reasoning effort, a cheaper model, or per-request routing holds the same accuracy for a fraction of the price. The only way to know is to score the cheaper option against your eval set instead of assuming it is worse.
Prompt engineering, prompt tuning, or fine-tuning — what is the difference?
Prompt engineering changes the words and needs no training. Prompt tuning trains a small soft-prompt vector and needs labeled data plus model access. Fine-tuning updates the model's own weights. Cost and access requirements rise in that order, which is why prompting is where most teams start.
Sources and further reading
-
A Systematic Survey of Automatic Prompt Optimization
— the academic map of APE, OPRO, MIPRO, and more -
DSPy
— declarative LLM pipelines and prompt compilation -
MIPRO
— multi-stage prompt optimization for LLM programs -
TextGrad
— textual feedback across AI computation graphs -
ProTeGi
— prompt optimization with textual critiques and beam search -
OpenAI reasoning models
— reasoning effort, reasoning tokens, and summaries -
Structured Outputs
— schema-constrained model responses -
Function calling
— tool calls and structured arguments -
Prompt caching
— reusable prompt prefixes for cost and latency -
Gemini thinking
— provider-specific thinking budgets and token accounting -
OWASP Top 10 for LLM Applications
— prompt injection and other production security risks -
Judging LLM-as-a-Judge
— position, verbosity, and self-preference bias -
Model Context Protocol architecture
— tools, resources, and prompt templates as context primitives
About EigenPrompt
This glossary is published by EigenPrompt, which does the optimization half of the vocabulary above for you. You give it a starting prompt template, a set of labeled examples, and the model you plan to run in production. It generates and tests prompt after prompt against your data, scores each on quality and on dollar cost, and returns the Pareto frontier — the set where nothing is cheaper without being worse, or better without being dearer. You pick the point that fits.
Two details make the result trustworthy: it optimizes for the specific model you name, not models in general, and it scores on held-out data the prompts never saw. It also flags the mislabeled and ambiguous examples capping your score, because the ceiling on any prompt is the quality of the data underneath it.