computer-science languages-and-literature visual-arts

Evaluator optimizer

Description

A pattern that pairs a generator (which produces candidate output) with an evaluator (which critiques it) in a loop: generate → critique → revise → critique → revise, until a quality threshold is reached or the iteration budget is exhausted. The structural value is that quality compounds across iterations — each round’s critique informs the next round’s generation, so the trajectory is upward even if each individual round is incremental. The diagnostic question — “is the critique getting acted on, or just produced?” — separates real evaluator-optimizer loops from ritualized review. A loop that produces critique but doesn’t revise (or revises without consulting the critique) is performative; the concept requires that the evaluator’s output structurally shapes the generator’s next output.

Triggers

User-initiated: User describes critique-revision cycles, asks “how do we improve output quality?”, or proposes adding a judge to an existing generation pipeline. Vocabulary cues: “judge,” “critic,” “LLM as judge,” “evaluator-optimizer,” “critique and revise.” Agent-initiated: Agent notices a generation system that’s running open-loop (no critique mechanism) when iteration could improve quality. Candidate inference: “would adding an evaluator-and-revision loop compound quality here?” Situation-shape signals: Quality-improvement projects where the current output is acceptable but not great. Discussions of LLM-as-judge architectures. Tasks where ground-truth correctness can be evaluated but not directly generated.

Exclusions

Single-shot tasks where iteration adds no value — generative tasks with no quality gradient or where iteration produces drift, not improvement.
No measurable quality criterion — if the evaluator can’t articulate what “better” means, the loop becomes critique-theater rather than optimizer.
Ground-truth-cheaper-than-evaluation — when checking against ground truth is faster than evaluating quality heuristically, skip the evaluator and check directly.
Adversarial-evaluator misalignment — if the evaluator’s quality criterion diverges from the generator’s actual goal, the loop optimizes for the wrong thing (a hoist-by-own-petard risk).

Structure

Relationships

Relationship neighborhood of evaluator-optimizer: a graph of the concepts it connects to and the concepts it is a part of.

feedback-loop — evaluator-optimizer is a specific feedback-loop (structured critique signal + revision response).
reflection — when generator and evaluator collapse into one actor critiquing its own output.
prompt-chaining — contrast: chaining is sequential (each stage feeds next); evaluator-optimizer is iterative (revise-and-re-evaluate).
orchestrator-workers — the orchestrator may dispatch generators and evaluators as workers; orchestrator-workers can wrap an evaluator-optimizer at task scope.
cadence — the iteration cadence is load-bearing for quality compounding; too-tight loops over-fit, too-loose loops lose signal.

Examples

Anthropic engineering blog, "Building Effective Agents" (2024) — evaluator-optimizer pattern. · computer-science

Anthropic’s “Building Effective Agents” post catalogs evaluator-optimizer as one of the canonical multi-step agent patterns: one LLM call (or sub-agent) generates a candidate output; a separate evaluator call critiques it against stated criteria; the generator revises in light of the critique; the loop repeats until acceptance or budget. The LLM-as-judge literature (Zheng et al., 2023) provides empirical grounding for using LLMs themselves as the evaluator role.Inference: The pattern’s portability is high because it captures a structure that long predates LLMs: editor-and-author revision cycles, peer-review-with-revisions in academic publishing, code review with re-submission, and design-critique-iteration in studio practice. The agent-architecture instance is the same shape rendered in tokens.The load-bearing design choice is the explicit separation of the evaluator role from the generator — different prompts, sometimes different models — rather than expecting a single chain-of-thought pass to critique itself implicitly. A generator critiquing itself shares the failure modes that produced the original output; a structurally-separate evaluator brings independent failure modes. This is the same insight academic publishing institutionalized as blind peer review and that software engineering institutionalized as code review by someone other than the author. Where the pattern fails — in any domain — is when the evaluator is structurally weaker than the generator, or when the loop budget is exhausted before convergence; both failure modes the LLM literature has empirically rediscovered.

Editor-author cycles in human writing · languages-and-literature

same shape; the editor’s critique drives the author’s revision.

Design critique iteration · visual-arts

designer + critic; the critique shapes next iteration; canonical creative-discipline pattern.

**GAN-style training** (machine learning) — generator + disc · computer-science

GAN-style training (machine learning) — generator + discriminator; same structural shape at the weights-update level.

Goodfellow et al. (2014), "Generative Adversarial Networks" — adversarial generator-evaluator at the weight-update level · computer-science

Goodfellow et al. (2014), “Generative Adversarial Networks” — adversarial generator-evaluator at the weight-update level.

LLM agent patterns (Anthropic, OpenAI cookbook) · computer-science

Across published agent pattern catalogs (Anthropic’s “Building Effective Agents,” OpenAI’s cookbook entries for self-critique workflows, and adjacent open-source frameworks), evaluator-optimizer recurs as a named architecture: one LLM call generates, another call (or the same model with a different prompt) judges the output against criteria, and the generator revises in light of the judgment.Inference: The catalog-of-patterns framing matters because evaluator-optimizer is presented as a distinct workflow choice rather than a default — implying both that it has identifiable advantages (compounding quality across rounds; auditable evaluator role) and that it has identifiable costs (round-trip latency; risk of evaluator-and-generator drift). The pattern’s persistence across independent practitioner catalogs is itself evidence that the shape is genuine architecture, not just a quirk of any one ecosystem.

Peer review with revisions in academic publishing · languages-and-literature

reviewers’ critique structurally informs the manuscript’s next version.

Zheng et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023 Datasets and Benchmarks Track (arXiv:2306.05685). · computer-science

The evaluator-optimizer loop only compounds quality if the critique in the loop is trustworthy — and the obvious objection is that an automated evaluator is itself fallible, so its critiques could be noise dressed up as judgment. Zheng et al. supply the empirical grounding for the evaluator role specifically: they tested whether a strong LLM (GPT-4) used as a judge agrees with human preferences when scoring chat-assistant responses, and found agreement of over 80% — the same rate at which two independent human experts agree with each other. That is the result that licenses putting an LLM in the evaluator seat: its verdicts are about as reliable as a second human reviewer’s.Crucially, the paper does not treat the evaluator as infallible. It documents three systematic biases the evaluator carries into the loop: position bias (favoring whichever response is shown first, sometimes flipping the verdict when the order is swapped), verbosity bias (equating length with quality), and self-enhancement bias (preferring outputs that resemble the judge’s own). These are exactly the failure modes that turn an evaluator-optimizer loop into critique-theater — the generator gets optimized toward longer, first-positioned, judge-flattering output rather than genuinely better output.Inference: When standing up an automated evaluator-optimizer loop, the load-bearing question is not “can the model critique?” but “does the critique track the quality I actually care about?” Zheng’s >80%-agreement result says the evaluator can be reliable enough to act on; their bias catalog says you must control for the evaluator’s own systematic distortions (randomize position, normalize for length, avoid same-family judge-generator pairs) or the loop will faithfully optimize the generator toward the evaluator’s blind spots rather than toward the goal.

​Evaluator optimizer

​Description

​Triggers

​Exclusions

​Structure

​Relationships

​Examples