computer-science law mathematics medicine-and-health philosophy

Chain of thought

Description

An explicit reasoning trace is exposed as part of the output, not just consumed internally. The agent “thinks out loud” — produces intermediate steps that lead to the conclusion — and the trace is itself part of the deliverable, not just scaffolding. The structural property: making reasoning visible changes the reasoning itself, by forcing structure on what might otherwise be quick-pattern-matching. The diagnostic question — “is the trace doing work, or is it decoration?” — separates real chain-of-thought from “showing the right answer with intermediate steps that don’t actually constrain it.” Real CoT has the property that following the trace produces the conclusion; decoration has steps that could be replaced or removed without changing the answer.

Triggers

User-initiated: User asks for reasoning steps, “show your work,” “think step by step,” or describes opacity in agent outputs. Vocabulary cues: “chain of thought,” “CoT,” “think step by step,” “show your work,” “reasoning trace.” Agent-initiated: Agent recognizes that exposing intermediate reasoning will improve its own output or audit-ability. Candidate inference: “this task is reasoning-heavy; produce a chain-of-thought before concluding.” Situation-shape signals: Tasks where the agent could pattern-match to a wrong answer; multi-step reasoning where one intermediate error invalidates the whole conclusion; outputs that need to be auditable; debugging conversations where the path-to-error matters more than the error itself.

Exclusions

Trivial tasks — asking for chain-of-thought on “what’s 2+2” produces ritual without value.
Time-critical decisions — when latency budgets don’t allow extended reasoning, CoT’s iteration cost exceeds its quality benefit.
Pattern-matchable tasks where intuition is reliable — some tasks are better solved by direct pattern recognition; forcing CoT can degrade performance.
Adversarial settings — exposing reasoning trace can leak information (e.g., revealing how a security system makes decisions).
Decorative CoT — when steps don’t actually constrain the answer, CoT becomes performance not substance; the agent could replace the trace with random plausible steps and still produce the same conclusion.

Structure

Relationships

Relationship neighborhood of chain-of-thought: a graph of the concepts it connects to and the concepts it is a part of.

reflection — chain-of-thought + reflection = think-then-reflect-on-thinking; the trace becomes the object of self-critique.
loop-completion — CoT makes gaps in reasoning visible; without the trace, missing steps are hard to catch.
doctrine — many CoT prompts are explicit doctrines (“when answering math problems: show work; check answer; verify against constraints”).
evaluator-optimizer — CoT makes the generator’s output more critique-able for the evaluator; the trace gives the evaluator something to assess beyond the final answer.
load-bearing — diagnostic: which steps in the trace are load-bearing? Removing decorative steps tightens the chain.

Examples

LLM prompting (Wei et al. 2022) · computer-science

explicit “think step by step” prompting that reliably improves reasoning task performance.

Mathematical proofs · mathematics

the steps are the proof; the conclusion alone is just a claim.

Atul Gawande (2009), *The Checklist Manifesto* — checklists as institutionalized chain-of-thought. · medicine-and-health

Gawande’s Checklist Manifesto documents the structural value of forced explicit reasoning across domains where intuitive expertise was assumed to be sufficient: aviation, surgery, construction, finance. The book’s core observation is that high-skill practitioners regularly skip steps they “know” — and that the skipped steps are disproportionately the ones whose failures are costly. Boeing’s pilot checklists were the first large-scale institutional response (after the B-17 prototype crash of 1935 was traced to a forgotten gust-lock); the WHO Surgical Safety Checklist that Gawande led the development of (2008-09) reduced major surgical complications by ~36% and mortality by ~47% across a study of eight hospitals worldwide.The structural shape — explicit step-by-step reasoning trace exposed to the practitioner as part of the work, not just internally — is chain-of-thought institutionalized into doctrine. Checklists work for the same reasons CoT prompting works for language models: forcing the reasoning to be visible (a) ensures each step is actually executed rather than skimmed, (b) makes omissions and errors auditable, and (c) constrains pattern-completion to follow the prescribed sequence rather than jumping to a familiar-looking conclusion.Inference: The cross-domain transfer sharpens both directions. CoT prompting can be read as importing the checklist pattern into LLM behavior; the WHO surgical checklist can be read as exporting LLM-style trace-exposure to human-expert workflows. The discipline that makes both work is the same: keep the steps load-bearing (not decorative) and visible (not internal), and the failure rate drops.

Code review discussions · computer-science

the reviewer’s reasoning trace is itself the artifact; the recommendation alone is opaque.

Engineering design docs · computer-science

Engineering design documents — ADRs (Architecture Decision Records), Google’s design-doc culture, RFC processes, Amazon’s six-page narratives — exist precisely so the reasoning trace behind a technical choice is exposed as part of the deliverable, not just consumed internally by the team that made the decision. A design doc that captures only “we chose React” without the alternatives considered, the constraints that ruled them out, and the load-bearing tradeoff that broke the tie has lost the chain-of-thought; future maintainers see the conclusion but cannot reconstruct or audit the reasoning.The format itself is a chain-of-thought template institutionalized for engineering judgment. ADR conventions standardize the trace: Context → Options → Decision → Consequences. Amazon’s six-pager forces the writer to expose their reasoning at sufficient depth that the document substitutes for a live discussion. The structural value isn’t the document — most are read at most twice — but the act of writing that forces the implicit reasoning into explicit form, which surfaces gaps that a quick verbal pitch would have papered over.Inference: The diagnostic for whether an engineering team’s design-doc practice is load-bearing or cargo-cult is to read the docs and ask: “if the conclusion section had been swapped with a different reasonable conclusion, would the trace still support it?” If yes, the trace is decoration and the document isn’t doing CoT work. If no, the trace genuinely constrains the conclusion and the doc is earning its keep.

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). "Large Language Models are Zero-Shot Reasoners." *Advances in Neural Information Processing Systems* (NeurIPS) 35 — zero-shot chain-of-thought via "Let's think step by step." · computer-science

Kojima et al.’s “Large Language Models are Zero-Shot Reasoners” (NeurIPS 2022) is one of the cleanest demonstrations of chain-of-thought as a structural intervention. The finding is almost startlingly simple: appending the single instruction “Let’s think step by step” to a question prompts a large language model to emit an explicit reasoning trace before its final answer, and this alone sharply improves accuracy on arithmetic and symbolic-reasoning benchmarks — without any worked examples in the prompt. Where Wei et al. (2022) had elicited chain-of-thought by supplying hand-written few-shot exemplars, Kojima et al. showed the behavior could be triggered zero-shot by one task-agnostic phrase.This instantiates chain-of-thought precisely: reasoning that is exposed as part of the output rather than consumed silently — “think out loud, then answer.” The paper’s striking implication is that making the reasoning explicit does not merely make it auditable; it improves the answer itself. The model already possessed latent reasoning capacity, but absent the instruction to externalize its steps, it tended to leap to an answer and get it wrong. Forcing the intermediate trace to surface changed the computation, not just its visibility. That distinguishes chain-of-thought from mere logging: the act of laying the steps out is part of what produces a better result, which is why the trace is load-bearing rather than decorative.

Legal briefs and judicial opinions · law

the reasoning is the contribution; the holding alone has no precedential force.

Scientific papers · philosophy

methods + results + discussion sections are the trace; the abstract is the conclusion.

Software engineering: design docs (Amazon's narrative-memo culture; Google's design-doc tradition). · computer-science

Software engineering: design docs (Amazon’s narrative-memo culture; Google’s design-doc tradition).

Surgical checklists explained · medicine-and-health

the rationale for each item is shown so the team understands why the order matters.

Wei et al. (2022), "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" — https://arxiv.org/abs/2201.11903; Kojima et al. (2022), "Large Language Models are Zero-Shot Reasoners"; broader lineage in mathematical proofs, legal briefs, and scientific papers. · computer-science

Wei et al.’s 2022 paper (“Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” arXiv:2201.11903) demonstrated that prompting large language models to produce intermediate reasoning steps before a final answer — rather than emitting only the answer — substantially improves performance on multi-step arithmetic, commonsense, and symbolic reasoning benchmarks. The improvement was largest for the largest models, suggesting an emergent capability that scale unlocks rather than a generic prompting trick.The paper introduced “chain-of-thought” as the now-standard term for the technique. Kojima et al. (2022) extended the result to zero-shot settings with “Let’s think step by step” as the canonical trigger phrase. Together, these papers established that exposing the reasoning trace — making the process visible — both improves the answer and makes it auditable, which is the load-bearing structural property the catalog’s chain-of-thought primitive names.

​Chain of thought

​Description

​Triggers

​Exclusions

​Structure

​Relationships

​Examples