Skip to main content
medicine-and-health statistics

Simpsons paradox

Description

A statistical phenomenon in which an aggregate trend across a population reverses direction when the population is stratified by a third variable. The same numeric data supports opposite conclusions depending on the aggregation level — naïvely “Treatment A beats Treatment B in the overall population” coexists with “Treatment B beats Treatment A in every subgroup.” The arithmetic is consistent; the apparent contradiction is real; the paradox is in the interpretation, not the data. The diagnostic question — “is there a third variable correlated with both the exposure and the outcome that might reverse the relationship when conditioned on?” — is the first move. The second move, which the arithmetic alone cannot answer, is “which view is the right answer to the question I am actually asking?” That requires the causal diagram. If the stratifier is a confounder (a common cause of exposure and outcome), the stratified view is correct and the aggregate is misleading. If the stratifier is a mediator or downstream collider, the aggregate may be the correct answer and stratification introduces a different bias. This is the load-bearing subtlety: Simpsons-paradox is not “always trust the stratified view.” It is “the data alone cannot tell you which view to trust; you have to supply the causal story.” Pearl’s causal-inference framework gives the explicit tools (do-calculus, backdoor criterion); the structural primitive in this catalog is the general shape that motivates needing those tools.

Triggers

User-initiated: User describes a pattern where the conclusion changes depending on how the data is aggregated, or notices a comparison whose direction reverses when stratified. Vocabulary cues: “Simpson’s paradox,” “aggregate reverses,” “pooled vs stratified,” “lurking variable,” “Berkeley admissions,” “kidney stone.” Agent-initiated: Agent notices an aggregate comparison being drawn without acknowledgment of stratification effects, or a stratified comparison being used to override an aggregate without acknowledgment of causal structure. Candidate inference: “what is the causal diagram here? Is there a third variable that could reverse this when conditioned on, and is the question being asked best answered by the aggregate or the stratified view?” Situation-shape signals: Disagreements over policy effects where one side cites aggregate data and the other cites subgroup data; meta-analyses that combine studies across populations; A/B test interpretations that mix new and returning users; fairness audits on aggregate ML metrics; clinical trial reports that emphasize overall efficacy without subgroup detail. The signal is strongest when the question being answered (causal claim, descriptive claim, predictive claim) is left implicit.

Exclusions

  • No reversal under stratification — when the aggregate and stratified views agree in direction, there is no paradox; the situation might still have confounding worth conditioning on, but the failure-of-interpretation that Simpsons-paradox names is absent.
  • Stratifier is a mediator, not a confounder — if the third variable lies on the causal pathway from exposure to outcome, conditioning on it removes part of the effect being measured. The stratified view is wrong in this case; the aggregate is correct. The arithmetic looks identical to the confounder case; the causal diagram is the difference.
  • Stratifier is a collider — conditioning on a common consequence of exposure and outcome introduces selection bias and can produce a spurious reversal. Not Simpsons-paradox proper, but commonly mistaken for it; the corrective is the same (draw the causal diagram).
  • Sparse strata where the stratified view is just noisy — within-stratum estimates can disagree with the aggregate purely from sampling variance, not from any causal-structure reversal. The paradox requires the stratified pattern to be robust, not an artifact of underpowered subgroup estimates.
  • Question genuinely is the aggregate — sometimes the policy question is “what is the population-level effect?” rather than “what is the individual-level effect?” In those cases the aggregate view is the right answer; using subgroup detail to override it imports a different question than the one being asked.

Structure

Internal structure of simpsons-paradox: a table of its component slots and the concepts that fill them.

Relationships

Relationship neighborhood of simpsons-paradox: a graph of the concepts it connects to and the concepts it is a part of.
  • confounding — mechanism-level partner. Confounding is what produces the reversal; Simpsons-paradox is the manifest reversal. Reading them together gives the structural shape (confounding) plus the empirical signature (paradox).
  • selection-bias — common driver. Berkeley admissions is selection-into-stratum producing the paradox; cohort-composition drift in product analytics is selection-by-time producing the paradox. The pair captures that “selection into stratum” is one of the most reliable real-world generators.
  • wisdom-of-crowds — naïve aggregation can be wrong, not just noisy. Wisdom-of-crowds requires the right independence/exchangeability conditions; Simpsons-paradox is the failure mode when those conditions break.
  • doctrine — causal-inference doctrines (DAGs, do-calculus, “draw the diagram before pooling,” potential-outcomes framework) exist as structural counter-pressure. Each installs the question the arithmetic cannot answer.
  • red-herring — the apparent main effect can be a red-herring at the aggregate level (the aggregate looks load-bearing for the question being asked, but the stratified view is the real answer). The contrast clarifies that internal data structure can mislead as effectively as external misdirection.
  • reframe — the resolution often requires reframing the question: “which treatment is better overall?” vs “which treatment is better for this patient?” produce different right answers from the same data. Simpsons-paradox forces the reframe to be made explicit.

Examples

Berkeley graduate admissions (Bickel, Hammel & O'Connell 1975) · statistics

aggregate data showed men admitted at higher rates than women; stratified by department, women were admitted at slightly higher rates in nearly every department. The reversal arose because women applied disproportionately to highly-selective departments. The canonical real-world demonstration; led to the dismissal of the apparent admissions-bias allegation.

Kidney stone treatment study (Charig et al. 1986) · medicine-and-health

Treatment B showed higher overall success rate than Treatment A; within each subgroup (small stones, large stones), Treatment A showed higher success rate. Doctors had assigned the harder cases to Treatment A. Frequently used as the medical-decision-making teaching case.
Player A can have a higher batting average than Player B in every individual season while having a lower career batting average, because the at-bat distributions across seasons differ. Sabermetrics literature treats this as foundational.
Bickel, Hammel, and O’Connell’s 1975 Science paper analyzed UC Berkeley’s 1973 graduate-admissions data, which showed at the university level a substantially higher admit rate for male applicants than female applicants — a pattern that looked, on its face, like discrimination. When the authors stratified by department, the university-level disparity disappeared and in many individual departments women’s admit rates were higher than men’s. The aggregate-vs-stratified reversal was driven by an uneven distribution of applicants: women applied disproportionately to departments with low admit rates overall (humanities programs), and men disproportionately to departments with high admit rates (sciences and engineering). The aggregate admission rate was confounded by program-level admission rate, which varied with applicant gender.The case became the canonical real-world illustration of Simpson’s paradox because the conclusion that follows depends entirely on which causal interpretation one assigns to the stratifier. If departmental selectivity is the mechanism through which gender produces the disparity — for instance, because earlier socialization channels women into less-funded fields — then the aggregate is the relevant view. If selectivity is a property of departments independent of who applies, then the stratified view exonerates each department individually. The arithmetic is identical; the verdict turns on causal structure.Inference: When an aggregate disparity reverses under stratification, the analytic move is to ask whether the stratifier is mediator (the mechanism through which the cause operates), confounder (a separate cause influencing both exposure and outcome), or collider (a common effect of both). The choice determines which view is interpretable as the causal estimate; statistics alone cannot decide.
The Charig et al. (1986) kidney-stone study is the canonical medical instance of the paradox because the underlying counts are documented and the lurking variable has an obvious clinical reason to exist. Comparing open surgery (Treatment A) against percutaneous nephrolithotomy (Treatment B), the overall success rates favored PNL: 78% (273/350) for open surgery versus 83% (289/350) for PNL. Stratified by stone size, the direction reverses in both strata: for small stones (<2cm), open surgery succeeds 93% (81/87) against PNL’s 87% (234/270); for large stones, open surgery succeeds 73% (192/263) against PNL’s 69% (55/80). Open surgery is better on small stones and better on large stones, yet worse overall.The reversal is structural, not arithmetic noise: it is driven by a confound with a clear causal story. Surgeons assigned the harder cases (large stones, 263 of open surgery’s 350) to the more invasive open procedure, and the easier cases (small stones, 270 of PNL’s 350) to the less invasive PNL. PNL’s aggregate average was inflated by being tested mostly on easy cases; open surgery’s was depressed by being tested mostly on hard ones. The third variable — stone size, a proxy for case severity — is a genuine confounder on the causal path from “which treatment” to “outcome” via the assignment mechanism, so the stratified view is the one that answers the clinical question “which treatment should I choose for a stone of this size?”Inference: when an aggregate comparison drives a real decision (which treatment, which vendor, which policy), check whether assignment to the compared groups was correlated with case difficulty. If the better-looking group got the easier cases, the aggregate is measuring the case mix, not the treatment, and the correct comparison is within strata of the severity variable — provided that variable is a confounder (a common cause of assignment and outcome) and not a mediator on the causal pathway.
early in the pandemic, aggregate national rates reversed when conditioned on age structure (countries with older populations showed worse aggregate outcomes; age-adjusted, the picture shifted substantially). Demographic stratification is the load-bearing move.
district-level achievement comparisons frequently reverse direction when stratified by demographic, income, or school-type variables. Policy debates rooted in aggregate data are vulnerable to this exact pattern.
model accuracy can be high overall while being systematically lower for specific subgroups, or accuracy can show a reverse pattern when conditioned on certain protected attributes. The fairness literature explicitly engages the Simpsons-paradox structure.
Judea Pearl’s Causality (2nd edition, 2009) is the standard reference for the modern causal-inference framework that places Simpson’s paradox in its rightful place: as a statement about which conditional distributions one happens to compute, not as a statement about reality. Pearl’s central move was to formalize the difference between observing a value of a variable and intervening to set it. Once that distinction is made precise via the do-calculus, the paradox dissolves — the apparently-contradictory aggregate and stratified estimates are answering different causal questions, and only one of them corresponds to the question being asked in any given application.The framework provides the constructive remedy: draw the causal DAG, identify the role of the stratifying variable (confounder, mediator, or collider), and use the do-calculus rules to identify which set of variables must be conditioned on to recover the causal estimate. The arithmetic that produces the paradoxical reversal is the same in all cases; the interpretation depends entirely on the causal structure the analyst posits.Inference: Any analysis that confronts Simpson’s-paradox-style reversals needs to make causal assumptions explicit. Refusing to draw a DAG does not avoid making assumptions; it merely hides them. The discipline of explicit causal modeling is what converts “which view do I trust?” into a question with a defensible answer — at the cost of forcing the analyst to commit to a model of how the variables actually relate.
Pearl’s 2014 American Statistician comment “Understanding Simpson’s Paradox” took on directly the rhetorical move made by statisticians who treat the paradox as a purely arithmetic curiosity (“see, you have to be careful when you aggregate!”) and argued that the standard pedagogical framing obscures the real lesson: the paradox is symptomatic of the larger fact that statistical reasoning alone cannot answer causal questions. Two datasets with identical contingency tables can demand opposite conclusions depending on the causal structure that generated them, and no purely-statistical procedure can recover the correct conclusion without external causal assumptions.The piece sharpens the methodological point Pearl made at length in Causality: the question “should I report the aggregate or the stratified estimate?” is not answerable from the data alone. The answer is forced by the causal role of the stratifier, which is a fact about the world and not a fact about the dataset. Refusing this insight by trying to find a “statistical” criterion for which view to trust leads to confidently-wrong conclusions across many real-world cases.Inference: Treating Simpson’s paradox as a teaching example of “be careful with averages” is a pedagogical anti-pattern — it obscures the structural lesson. The diagnostic value of the paradox is that it surfaces the necessity of explicit causal reasoning. Any data-science curriculum should treat the paradox as the gateway to causal-inference literacy, not as a cautionary tale to be memorized in isolation.
overall engagement metrics that increase quarter-over-quarter while every individual cohort’s engagement is decreasing — explained by changing cohort composition (newer cohorts being larger and more engaged at start). Common in growth-product analytics; the corrective is cohort-conditioned analysis.
E. H. Simpson’s 1951 Journal of the Royal Statistical Society paper “The interpretation of interaction in contingency tables” gave the phenomenon its now-canonical name and treatment. Simpson constructed a small numerical example showing that the within-stratum association between two attributes could be of one sign while the aggregate association across strata was of the opposite sign, and laid out the conditions under which this reversal can happen — specifically, when the stratifying variable is unevenly distributed across the categories being compared.The paper’s contribution was less the construction (Yule had observed a related phenomenon in 1903) than the precision with which Simpson framed the analytic question: when do we have reason to trust the aggregate, and when do we have reason to trust the stratified view? Simpson’s answer was that the question is unanswerable on purely arithmetic grounds — the analyst has to know how the data was generated and what relationship is being measured. The clarity of this framing established the paradox as a methodological touchstone rather than a numerical curiosity.Inference: Simpson’s original framing is worth reading directly because it makes the load-bearing point — the arithmetic does not decide — without the later causal-DAG vocabulary. For audiences who balk at causal-inference formalism, the 1951 paper provides the same conclusion through pure statistical reasoning: when the stratification is uneven and the within-stratum effects are real, the aggregate is a different quantity than the average within-stratum effect.
average treatment effect can mask subgroup effects of opposite sign; the literature on heterogeneous treatment effects, conditional average treatment effects (CATE), and individual treatment effects (ITE) is built around recognizing and characterizing this.
George Udny Yule’s 1903 Biometrika paper “Notes on the theory of association of attributes in statistics” observed the same arithmetic phenomenon Simpson would later popularize: that the partial associations between two attributes, when summed across a third attribute’s categories, can yield an overall association of opposite sign. Yule developed the result in the context of his ongoing methodological program for analyzing categorical data (the same program that produced the Yule’s Q and Yule’s Y association measures).The paper is historically important because it shows that the structural phenomenon was recognized in the statistical literature half a century before Simpson’s renaming, but never crystallized into the canonical pedagogical example. Two factors explain this: Yule’s work was embedded in a denser methodological apparatus that made the implications harder to extract, and the causal-inference framework that gives the phenomenon its sharp interpretation did not yet exist. The reversal was a curiosity until causal reasoning gave it operational consequences.Inference: Statistical phenomena often wait for the right interpretive framework before they can be taken seriously by practitioners. Yule’s 1903 observation was technically complete but not actionable; Simpson’s 1951 renaming made it more pedagogically tractable; Pearl’s 2009 causal-DAG framework made it generative. When a researcher reports a “well-known but not-acted-upon” issue, the limiting factor is often the conceptual framework, not the observation itself.