Skip to main content
computer-science law mathematics medicine-and-health philosophy

Chain of thought

Description

An explicit reasoning trace is exposed as part of the output, not just consumed internally. The agent “thinks out loud” — produces intermediate steps that lead to the conclusion — and the trace is itself part of the deliverable, not just scaffolding. The structural property: making reasoning visible changes the reasoning itself, by forcing structure on what might otherwise be quick-pattern-matching. The diagnostic question — “is the trace doing work, or is it decoration?” — separates real chain-of-thought from “showing the right answer with intermediate steps that don’t actually constrain it.” Real CoT has the property that following the trace produces the conclusion; decoration has steps that could be replaced or removed without changing the answer.

Triggers

User-initiated: User asks for reasoning steps, “show your work,” “think step by step,” or describes opacity in agent outputs. Vocabulary cues: “chain of thought,” “CoT,” “think step by step,” “show your work,” “reasoning trace.” Agent-initiated: Agent recognizes that exposing intermediate reasoning will improve its own output or audit-ability. Candidate inference: “this task is reasoning-heavy; produce a chain-of-thought before concluding.” Situation-shape signals: Tasks where the agent could pattern-match to a wrong answer; multi-step reasoning where one intermediate error invalidates the whole conclusion; outputs that need to be auditable; debugging conversations where the path-to-error matters more than the error itself.

Exclusions

  • Trivial tasks — asking for chain-of-thought on “what’s 2+2” produces ritual without value.
  • Time-critical decisions — when latency budgets don’t allow extended reasoning, CoT’s iteration cost exceeds its quality benefit.
  • Pattern-matchable tasks where intuition is reliable — some tasks are better solved by direct pattern recognition; forcing CoT can degrade performance.
  • Adversarial settings — exposing reasoning trace can leak information (e.g., revealing how a security system makes decisions).
  • Decorative CoT — when steps don’t actually constrain the answer, CoT becomes performance not substance; the agent could replace the trace with random plausible steps and still produce the same conclusion.

Structure

Internal structure of chain-of-thought: a table of its component slots and the concepts that fill them.

Relationships

Relationship neighborhood of chain-of-thought: a graph of the concepts it connects to and the concepts it is a part of.
  • reflection — chain-of-thought + reflection = think-then-reflect-on-thinking; the trace becomes the object of self-critique.
  • loop-completion — CoT makes gaps in reasoning visible; without the trace, missing steps are hard to catch.
  • doctrine — many CoT prompts are explicit doctrines (“when answering math problems: show work; check answer; verify against constraints”).
  • evaluator-optimizer — CoT makes the generator’s output more critique-able for the evaluator; the trace gives the evaluator something to assess beyond the final answer.
  • load-bearing — diagnostic: which steps in the trace are load-bearing? Removing decorative steps tightens the chain.

Examples

LLM prompting (Wei et al. 2022) · computer-science

explicit “think step by step” prompting that reliably improves reasoning task performance.

Mathematical proofs · mathematics

the steps are the proof; the conclusion alone is just a claim.
Gawande’s Checklist Manifesto documents the structural value of forced explicit reasoning across domains where intuitive expertise was assumed to be sufficient: aviation, surgery, construction, finance. The book’s core observation is that high-skill practitioners regularly skip steps they “know” — and that the skipped steps are disproportionately the ones whose failures are costly. Boeing’s pilot checklists were the first large-scale institutional response (after the B-17 prototype crash of 1935 was traced to a forgotten gust-lock); the WHO Surgical Safety Checklist that Gawande led the development of (2008-09) reduced major surgical complications by ~36% and mortality by ~47% across a study of eight hospitals worldwide.The structural shape — explicit step-by-step reasoning trace exposed to the practitioner as part of the work, not just internally — is chain-of-thought institutionalized into doctrine. Checklists work for the same reasons CoT prompting works for language models: forcing the reasoning to be visible (a) ensures each step is actually executed rather than skimmed, (b) makes omissions and errors auditable, and (c) constrains pattern-completion to follow the prescribed sequence rather than jumping to a familiar-looking conclusion.Inference: The cross-domain transfer sharpens both directions. CoT prompting can be read as importing the checklist pattern into LLM behavior; the WHO surgical checklist can be read as exporting LLM-style trace-exposure to human-expert workflows. The discipline that makes both work is the same: keep the steps load-bearing (not decorative) and visible (not internal), and the failure rate drops.
the reviewer’s reasoning trace is itself the artifact; the recommendation alone is opaque.
Engineering design documents — ADRs (Architecture Decision Records), Google’s design-doc culture, RFC processes, Amazon’s six-page narratives — exist precisely so the reasoning trace behind a technical choice is exposed as part of the deliverable, not just consumed internally by the team that made the decision. A design doc that captures only “we chose React” without the alternatives considered, the constraints that ruled them out, and the load-bearing tradeoff that broke the tie has lost the chain-of-thought; future maintainers see the conclusion but cannot reconstruct or audit the reasoning.The format itself is a chain-of-thought template institutionalized for engineering judgment. ADR conventions standardize the trace: Context → Options → Decision → Consequences. Amazon’s six-pager forces the writer to expose their reasoning at sufficient depth that the document substitutes for a live discussion. The structural value isn’t the document — most are read at most twice — but the act of writing that forces the implicit reasoning into explicit form, which surfaces gaps that a quick verbal pitch would have papered over.Inference: The diagnostic for whether an engineering team’s design-doc practice is load-bearing or cargo-cult is to read the docs and ask: “if the conclusion section had been swapped with a different reasonable conclusion, would the trace still support it?” If yes, the trace is decoration and the document isn’t doing CoT work. If no, the trace genuinely constrains the conclusion and the doc is earning its keep.
Kojima et al.’s “Large Language Models are Zero-Shot Reasoners” (NeurIPS 2022) is one of the cleanest demonstrations of chain-of-thought as a structural intervention. The finding is almost startlingly simple: appending the single instruction “Let’s think step by step” to a question prompts a large language model to emit an explicit reasoning trace before its final answer, and this alone sharply improves accuracy on arithmetic and symbolic-reasoning benchmarks — without any worked examples in the prompt. Where Wei et al. (2022) had elicited chain-of-thought by supplying hand-written few-shot exemplars, Kojima et al. showed the behavior could be triggered zero-shot by one task-agnostic phrase.This instantiates chain-of-thought precisely: reasoning that is exposed as part of the output rather than consumed silently — “think out loud, then answer.” The paper’s striking implication is that making the reasoning explicit does not merely make it auditable; it improves the answer itself. The model already possessed latent reasoning capacity, but absent the instruction to externalize its steps, it tended to leap to an answer and get it wrong. Forcing the intermediate trace to surface changed the computation, not just its visibility. That distinguishes chain-of-thought from mere logging: the act of laying the steps out is part of what produces a better result, which is why the trace is load-bearing rather than decorative.
methods + results + discussion sections are the trace; the abstract is the conclusion.
Software engineering: design docs (Amazon’s narrative-memo culture; Google’s design-doc tradition).
the rationale for each item is shown so the team understands why the order matters.