Evaluator optimizer
Description
A pattern that pairs a generator (which produces candidate output) with an evaluator (which critiques it) in a loop: generate → critique → revise → critique → revise, until a quality threshold is reached or the iteration budget is exhausted. The structural value is that quality compounds across iterations — each round’s critique informs the next round’s generation, so the trajectory is upward even if each individual round is incremental. The diagnostic question — “is the critique getting acted on, or just produced?” — separates real evaluator-optimizer loops from ritualized review. A loop that produces critique but doesn’t revise (or revises without consulting the critique) is performative; the concept requires that the evaluator’s output structurally shapes the generator’s next output.Triggers
User-initiated: User describes critique-revision cycles, asks “how do we improve output quality?”, or proposes adding a judge to an existing generation pipeline. Vocabulary cues: “judge,” “critic,” “LLM as judge,” “evaluator-optimizer,” “critique and revise.” Agent-initiated: Agent notices a generation system that’s running open-loop (no critique mechanism) when iteration could improve quality. Candidate inference: “would adding an evaluator-and-revision loop compound quality here?” Situation-shape signals: Quality-improvement projects where the current output is acceptable but not great. Discussions of LLM-as-judge architectures. Tasks where ground-truth correctness can be evaluated but not directly generated.Exclusions
- Single-shot tasks where iteration adds no value — generative tasks with no quality gradient or where iteration produces drift, not improvement.
- No measurable quality criterion — if the evaluator can’t articulate what “better” means, the loop becomes critique-theater rather than optimizer.
- Ground-truth-cheaper-than-evaluation — when checking against ground truth is faster than evaluating quality heuristically, skip the evaluator and check directly.
- Adversarial-evaluator misalignment — if the evaluator’s quality criterion diverges from the generator’s actual goal, the loop optimizes for the wrong thing (a hoist-by-own-petard risk).
Structure
Relationships
- feedback-loop — evaluator-optimizer is a specific feedback-loop (structured critique signal + revision response).
- reflection — when generator and evaluator collapse into one actor critiquing its own output.
- prompt-chaining — contrast: chaining is sequential (each stage feeds next); evaluator-optimizer is iterative (revise-and-re-evaluate).
- orchestrator-workers — the orchestrator may dispatch generators and evaluators as workers; orchestrator-workers can wrap an evaluator-optimizer at task scope.
- cadence — the iteration cadence is load-bearing for quality compounding; too-tight loops over-fit, too-loose loops lose signal.
Examples
Anthropic engineering blog, "Building Effective Agents" (2024) — evaluator-optimizer pattern. · computer-science
Anthropic engineering blog, "Building Effective Agents" (2024) — evaluator-optimizer pattern. · computer-science
Editor-author cycles in human writing · languages-and-literature
Editor-author cycles in human writing · languages-and-literature
Design critique iteration · visual-arts
Design critique iteration · visual-arts
**GAN-style training** (machine learning) — generator + disc · computer-science
**GAN-style training** (machine learning) — generator + disc · computer-science
Goodfellow et al. (2014), "Generative Adversarial Networks" — adversarial generator-evaluator at the weight-update level · computer-science
Goodfellow et al. (2014), "Generative Adversarial Networks" — adversarial generator-evaluator at the weight-update level · computer-science
LLM agent patterns (Anthropic, OpenAI cookbook) · computer-science
LLM agent patterns (Anthropic, OpenAI cookbook) · computer-science
Peer review with revisions in academic publishing · languages-and-literature
Peer review with revisions in academic publishing · languages-and-literature
Zheng et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023 Datasets and Benchmarks Track (arXiv:2306.05685). · computer-science
Zheng et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023 Datasets and Benchmarks Track (arXiv:2306.05685). · computer-science