computer-science engineering-and-technology medicine-and-health

Root cause analysis

Description

Trace the symptom backward through its causal chain — “why did this happen?”, “and why did THAT?”, “and why did THAT?” — until you reach the deepest fixable node, and put the countermeasure there. The shape’s load-bearing claim is that surface symptoms are almost always downstream of a deeper cause; addressing the symptom without addressing the cause leaves the cause live, and the symptom re-emerges in the next instance. The canonical operationalization is Sakichi Toyoda’s “5 Whys”, formalized in the Toyota Production System by Taiichi Ohno: ask “why?” repeatedly — usually about five times, though the number is heuristic — until the chain bottoms out at something that, once changed, prevents recurrence. The Toyota-canonical illustration walks down a machine-stop: fuse blew → bearing not lubricated → pump not pumping → pump shaft worn → no strainer, metal scrap got in. Replacing the fuse “fixes” the symptom; fitting the strainer is the countermeasure that collapses the whole tree. The “five” is not a magic number — sometimes three “why”s reach the root, sometimes seven — and a key part of the discipline is recognizing when you’ve actually arrived. The structural shape: surface symptom + causal chain + fixable origin. The search geometry is vertical — depth-first descent along one chain. This distinguishes root-cause-analysis from its diagnostic sibling differential-diagnosis, whose geometry is lateral — breadth-first enumeration of candidate causes at a single level, narrowed by discriminating tests. RCA asks “and why that?”; DDx asks “and what else could it be?”. They compose — each link in an RCA chain may be identified via a mini-DDx, and a completed DDx may prompt an RCA into the winning candidate’s mechanism — but the unit operation is distinct, and choosing the wrong geometry for the problem is one of the recurrent ways diagnostic work goes wrong. RCA has several named variants and adjacent disciplines: the Ishikawa “fishbone” diagram (Kaoru Ishikawa, 1960s) organizes the chain by category (people, process, equipment, materials, environment, measurement) and was Toyota-adjacent quality-control practice; fault-tree analysis (Bell Labs, 1962, originally for the Minuteman missile system) builds out the chain as a Boolean tree with AND/OR gates and is the dominant form in aerospace and nuclear safety; blameless post-mortem (Allspaw 2012, popularized via the Google SRE Book 2016) is the software/SRE adaptation, with the explicit discipline that “operator error” is treated as a symptom whose root lies deeper in the system, not as a stopping point. The concept’s exclusions are load-bearing for not over-applying it. RCA assumes a single chain with a terminal root; when the causal structure is a cycle (predator-prey, deadlock), when it’s a confluence of necessary-but-not-sufficient factors (Bhopal, modern cloud outages, complex socio-technical systems), or when the deepest discovered node is unfixable, the shape category-mismatches. The Resilience Engineering school (Hollnagel, Dekker, Woods, Allspaw) explicitly critiques single-root framing for complex socio-technical systems and advocates for “contributing factors” or “confluence of conditions” instead.

Aliases

RCA is the standard abbreviation across industrial quality, aerospace, healthcare, and software incident response. 5 Whys is the Toyota-canonical instance — strictly a subset of RCA (the iterative-why version), but used loosely as a synonym in popular usage. The catalog admits both as aliases because both name the same vertical-chain-tracing shape; the difference is granularity of formalization, not structural distinctness. The ## Aliases discussion also covers the named variants that are close enough to be confusable. Ishikawa fishbone (Kaoru Ishikawa, 1960s) is a category-organized RCA — the chain is built out as a six-branch diagram (people / process / equipment / materials / environment / measurement) before being traced. Fault-tree analysis (FTA, Bell Labs 1962) is the aerospace/nuclear variant, with explicit Boolean AND/OR gates representing how multiple causes combine. These are not aliases for RCA — they are RCA’s specializations, with their own structural commitments — but they share the vertical-descent shape. Etymologically, “root cause” is a metaphor from botany — the root is the part of the plant below the ground, the source from which everything visible grows; cutting the visible plant leaves the root alive to regrow. The metaphor captures the discipline’s central claim: the symptom you see is downstream of an unseen origin; fixing what’s visible without addressing the root leaves the system in the same recurrence-prone state.

Triggers

User-initiated: User asks “why did this happen?” and follows up with “and why did THAT?” — repeated why chains. Or says “we need to find the root cause,” “treat the cause not the symptom,” “let’s do an RCA / 5 Whys / fishbone / fault tree on this,” “the underlying issue is…”. Or describes shipping a fix that didn’t stick — the symptom came back — and asks how to make it stick this time. Agent-initiated: Agent notices that a proposed fix targets the surface symptom rather than the cause, and the symptom is the kind that will recur (recurring bug, recurring outage, repeated patient presentation, repeated production defect). Candidate inference: “we should trace this back — what’s the deepest fixable node, and how do we know we’ve arrived?” Situation-shape signals: A recurring failure that earlier fixes didn’t eliminate. A post-mortem or incident review. A machine, system, or process that stopped working with an identifiable proximate cause but unknown deeper history. A patient whose symptoms keep recurring under symptomatic treatment. A quality defect that escaped earlier inspection. A test that fails and the fix is to suppress the test rather than understand the failure.

Exclusions

Differential diagnosis (lateral discrimination among candidates) — the load-bearing exclusion. Differential-diagnosis fans OUT at one level, enumerating multiple plausible candidates and applying tests that discriminate among them. Root-cause-analysis descends DOWN one chain, asking “and why that?” link by link until it reaches a fixable origin. The search geometry is different: lateral fan-out for DDx, vertical descent for RCA. They compose — at each RCA link you may run a mini-DDx — but the unit operation is distinct. Mistaking a multi-candidate puzzle for an RCA-shaped problem produces premature commitment to one chain and misses the better candidate; mistaking an RCA-shaped problem for a DDx produces shallow horses-not-zebras narrowing that stops at the proximate symptom.
Circular causality (A causes B causes A) — no terminal root to find. When the system’s causal structure is a cycle rather than a chain (predator-prey dynamics; deadlock between mutually-waiting processes; the spiraling feedback between debt and interest), there’s no deepest fixable node — every node is “caused by” the next and the chain wraps. The right move is to identify and break the cycle (feedback-loop terminology), not to keep descending. RCA’s implicit causal-chain assumption fails here.
Multi-causal systems with no single root — the “perfect storm” shape. Complex socio-technical failures (Bhopal, Three Mile Island, modern cloud outages) are a CONFLUENCE of multiple necessary-but-not-sufficient factors: bad deploy AND CI gap AND monitoring miss AND runbook error AND staffing-state, all required. Removing one factor might prevent this particular instance but leaves the others live. Allspaw and the Resilience Engineering school (Hollnagel, Dekker, Woods) explicitly critique single-root framing in these systems and advocate for “contributing factors.” When the causal graph is a confluence not a chain, RCA’s shape category-mismatches; treating contributing factors as if one were “the” root produces a false closure.
Proximate cause / catalyst vs. root. The proximate cause is the last link before the symptom (the spark); the root is the deepest fixable origin (the leaking gas line that let the spark be catastrophic). RCA’s discipline includes refusing to stop at the catalyst — the trigger or accelerant is usually not the originator. Cargo-cult RCA stops at the proximate cause (the operator who pressed the button) and ships a countermeasure aimed at the catalyst rather than the originating node; the symptom returns under the next operator. Toyota’s “fix” vs “countermeasure” distinction codifies this: a fix patches the symptom, a countermeasure addresses the root.

Structure

Relationships

Relationship neighborhood of root-cause-analysis: a graph of the concepts it connects to and the concepts it is a part of.

differential-diagnosis — sibling diagnostic shape on the same goal (find the true cause), different search geometry (vertical descent vs lateral discrimination). The pair surfaces a useful axis: the diagnostic-shape question is not “what kind of inference?” but “lateral or vertical?”, and competent diagnostic work composes them.
schema-anomaly — typical RCA entry point. The recognized anomaly (the machine stopped, the test failed, the symptom appeared) is what motivates the descent. Schema-anomaly is upstream of both RCA and find-the-game; the choice between them depends on whether you want to trace the anomaly’s cause or treat the anomaly itself as load-bearing.
load-bearing — the root is the load-bearing cause: removing it collapses the downstream symptom tree; removing a non-root link doesn’t. Load-bearing gives RCA its operational test for “have I arrived at the fixable origin?”
cargo-cult — the failure mode. Cargo-cult RCA stops at the proximate cause and ships a fix shaped like a root-fix without the load-bearing force-dynamic. Toyota’s “fix” vs “countermeasure” distinction names exactly this asymmetry.

Examples

Ohno, T. (1988). *Toyota Production System: Beyond Large-Scale Production*. Productivity Press. (Original Japanese edition 1978.) Page 17. · engineering-and-technology

The canonical exposition of the “5 Whys” technique. Taiichi Ohno credits Sakichi Toyoda — the founder of Toyota Industries and inventor of the automatic loom — with the practice of “observing with a blank mind” and asking “why?” repeatedly to find the source of mechanical failures. Ohno codified it as a formal management tool inside the Toyota Production System and exposited it on page 17 of Toyota Production System: Beyond Large-Scale Production.The book’s worked illustration of a stopped machine:

Why did the machine stop? There was an overload and the fuse blew.
Why was there an overload? The bearing was not sufficiently lubricated.
Why was it not lubricated sufficiently? The lubrication pump was not pumping sufficiently.
Why was it not pumping sufficiently? The shaft of the pump was worn and rattling.
Why was the shaft worn out? There was no strainer attached and metal scrap got in.

Replacing the fuse “fixes” the symptom — and the same failure returns in a few months. Fitting a strainer is the countermeasure that prevents recurrence. Ohno’s terminology codifies the distinction: a “fix” is a patch on the symptom; a “countermeasure” is the change at the root that collapses the whole symptom tree.Inference: three pieces of Toyota-canonical doctrine that travel with the technique and are part of the concept’s structural shape: (1) the five is heuristic, not magical — sometimes three “why”s reach the root, sometimes seven, and the discipline is recognizing arrival, not counting iterations; (2) the analysis must be done at the gemba — the actual place where the work happens — not from reports or in a conference room, or the chain reaches the wrong root; (3) a known limitation is the linearity trap — the technique presumes a single chain and a single root, and category-mismatches against the multi-causal confluence shape that drives complex socio-technical failures.

Standard clinical medicine pedagogy. See *Harrison's Principles of Internal Medicine* (Part 2 on cardinal manifestations vs. Part 5 on infectious diseases) for the symptomatic-vs-etiologic distinction; Roy Porter, *The Greatest Benefit to Mankind: A Medical History of Humanity* (1997) for the Koch-postulates inflection. · medicine-and-health

Clinical medicine codifies the symptom-vs-root distinction directly into its therapeutic vocabulary. The canonical strata, from surface to depth:

Symptomatic (supportive, palliative) treatment — relieves the patient’s discomfort or distress without affecting the disease process. Acetaminophen for fever, oxygen for hypoxia, antitussives for cough.
Pathogenetic treatment — intervenes in the mechanism of the disease, somewhere between symptom and ultimate cause. ACE inhibitors blocking the hormonal cascade in heart failure; PPIs blocking gastric acid production in GERD.
Etiologic (causal, specific) treatment — targets the originating agent or condition. Antibiotics for the causative organism in bacterial infection; surgical correction for a hiatal hernia driving reflux; removal of a tumor causing chronic headache.

Robert Koch’s postulates in the 1880s made the symptomatic-to-etiologic shift operational. Pre-Koch, medicine was largely symptomatological: diseases were defined by their surface presentation (the word “dropsy” simply named visible swelling; “consumption” named wasting), and the treatment vocabulary stopped at managing the visible manifestation — there was no recognized deeper node to descend to. Koch’s framework — identify a specific microbe, demonstrate it causes a specific syndrome, eliminate it from the patient — installed the chain-tracing discipline by providing, for the first time, an actually-fixable causal node below the symptom. “Pneumonia” went from being a description of fever-and-cough to a specific diagnosis of Streptococcus pneumoniae infection with a target for intervention.The pedagogical exemplar is bacterial pneumonia. Symptomatic-only treatment (oxygen, antipyretics, cough suppressants) makes the patient feel better while the bacteria continue to multiply — the symptoms recur, sepsis is risked, and the suppression of the cough can actively worsen the etiologic course. Antibiotics targeting the organism are the etiologic intervention; they collapse the symptom tree because they remove the cause that was generating the surface presentation. The “red flag” discipline trained into clinicians — when a chronic headache might not be tension but a tumor; when chest pain might not be costochondritis but coronary occlusion — exists precisely because symptomatic stop-points miss roots that recur with greater downstream cost.Inference: the medical vocabulary makes explicit a depth-axis that other RCA traditions leave implicit. Symptomatic / pathogenetic / etiologic isn’t binary “symptom vs root” — it’s a graded chain, and the choice of stopping point is a clinical judgment about (a) whether a deeper node is actually fixable in this patient, (b) whether the symptom relief is itself disease-modifying (some chronic-pain conditions where the symptom becomes the cause of further sensitization), and (c) what time horizon the treatment is for (palliative-by-design vs. curative-by-design). This refines RCA’s “fixable origin” criterion: the root isn’t always the deepest node — it’s the deepest node where intervention is both effective and actionable, and “actionable” is itself a context-bound judgment.

Presidential Commission on the Space Shuttle Challenger Accident (Rogers Commission). *Report to the President*, June 1986. Volume 1, Chapters IV–VII. Plus Richard P. Feynman, "Personal Observations on the Reliability of the Shuttle" (Appendix F to the Rogers report, reprinted in *What Do You Care What Other People Think?*, 1988). · engineering-and-technology

The investigation of the January 28, 1986 Space Shuttle Challenger explosion is a textbook root-cause trace through a deep causal chain. The Commission opened with a lateral candidate enumeration — sabotage, external-tank failure, engine malfunction — and ruled them out. Then the work shifted to a vertical descent through the surviving candidate.The chain the main report (Volume 1) traced:

The vehicle broke up 73 seconds after liftoff. Why? A plume of hot gas escaped from the aft field joint of the right Solid Rocket Booster and burned through the external tank’s hydrogen section.
Why did the joint leak? The two O-ring seals in that joint failed to seat against the gap that opens during ignition pressure (“joint rotation”), and combustion gases blew past them.
Why did the O-rings fail to seat? The ambient temperature at launch (~36 °F, the coldest of any shuttle launch) cold-stiffened the rubber and destroyed its resilience — the material could no longer rebound fast enough to fill the rotating gap.
Why was the joint flying in a temperature regime its sealing material couldn’t handle? The joint had a “faulty design” known to be sensitive to temperature, with documented O-ring erosion (“blow-by”) in prior flights since 1977 — and the launch decision proceeded despite Morton Thiokol engineers’ objections during the eve-of-launch teleconference.

The report concludes with Chapter VI’s now-famous framing: “an accident rooted in history.” The vertical chain reached a fixable origin at the joint-design level, with countermeasures aimed at redesign and at the management process that had normalized the prior O-ring erosion as acceptable.Feynman’s Appendix F (added over Chairman Rogers’ objection — Feynman threatened to remove his name from the entire report) ran a parallel chain into the organizational substrate. The probabilistic-risk-estimate divergence between engineers (~1 in 100 failure) and management (~1 in 100,000) was traced not to honest disagreement but to the management chain’s use of failure-rate figures as PR rather than as engineering reality, and to structural suppression of dissenting engineering channels. His televised ice-water demonstration on February 11, 1986 — using a C-clamp and a glass of ice water to show the cold O-ring’s loss of resilience — provided the visceral proof of the technical root and bypassed engineering jargon for the public.Inference: the Challenger investigation illustrates the parallel-chain extension of RCA — the technical chain reached the O-ring joint design; a second RCA chain descended into the organizational decision process and reached different fixable origins (risk-estimate methodology, dissent channels, schedule-vs-safety culture). When the failure spans both technical and socio-technical substrates, a single chain misses the latter; running RCA on both yields the actionable countermeasures the technical chain alone would have left untouched. This is also Diane Vaughan’s later refinement (The Challenger Launch Decision, 1996) — “normalization of deviance” names the specific organizational failure mode at the root of the second chain.

Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (Eds.). (2016). *Site Reliability Engineering: How Google Runs Production Systems*. O'Reilly. Chapter 15, "Postmortem Culture." Plus John Allspaw, "Blameless PostMortems and a Just Culture" (Etsy Code as Craft blog, 2012) and *Etsy Debriefing Facilitation Guide* (2016); Sidney Dekker, *The Field Guide to Understanding 'Human Error'* (2nd ed., 2006). · computer-science

The software industry’s main adaptation of RCA is the blameless post-mortem, formalized at Etsy by John Allspaw in 2012 and codified across the industry by the Google SRE Book in 2016. The Google SRE Book’s Chapter 15 (“Postmortem Culture”) defines a post-mortem as a record of the incident, “the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions” — using “root cause” as the standard terminology, but specifically pluralized and paired with the discipline that “for a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual.”The blamelessness is exactly the discipline that operator error is a symptom, not a stopping point. When an outage is triggered by an engineer pressing the wrong button or running the wrong command, the surface temptation is to stop the chain at “operator error” and ship a countermeasure aimed at the operator (training, reprimand, removed access). The blameless discipline forces the descent to continue: why did the UI / CLI / runbook allow that catastrophic action to be one keystroke away? Why did the staging environment fail to catch it? Why did the alerting fail to give the operator enough warning? The chain typically reaches a missing safeguard, a brittleness in the system, or a procedural drift — fixable origins that ship countermeasures that prevent recurrence under the next operator. Sidney Dekker’s framing — “underneath every simple, obvious story about ‘human error,’ there is a deeper, more complex story about the organization” — is the explicit articulation of treating-operator-error-as-symptom inside the SRE / safety-engineering tradition.The framing is, however, contested. John Allspaw himself, the Resilience Engineering school (Hollnagel, Woods, Cook), and Dekker’s later work explicitly critique single-root framing for complex socio-technical systems. The critique: incidents in these systems are typically a confluence of multiple necessary-but-not-sufficient factors — bad deploy AND CI gap AND monitoring miss AND runbook error AND staffing-state, all required, no one of which collapses the symptom tree if removed alone. Allspaw’s “Etsy Debriefing Facilitation Guide” advocates for “contributing factors” and “confluence of conditions” rather than “the root cause.” This is the same critique Erik Hollnagel applies to the 5 Whys generally — the linearity trap. The community currently holds both framings in tension: “root cause” stays as the entry vocabulary that practitioners recognize, but the actual analysis pushes toward multi-factor confluence in mature post-mortem culture.Inference: the contested status is itself part of the concept’s shape. RCA’s clean single-chain-with-a-root form works well for mechanical failures (Toyota’s machine-stop) and for system failures with a single dominant chain (Challenger’s O-ring technical chain). For complex socio-technical systems with deeply-entangled multi-factor causation, the shape category-mismatches — and the practitioner-recognized response is to broaden from “the root” to “contributing factors,” which is structurally a different operation (more like a partial causal graph than a chain). That exclusion is load-bearing for the concept; ignoring it produces false closure (“we found THE root cause” when actually one of five necessary factors was named) and a countermeasure tree that misses four-fifths of the system’s recurrence surface.

​Root cause analysis

​Description

​Aliases

​Triggers

​Exclusions

​Structure

​Relationships

​Examples