Simpsons paradox
Description
A statistical phenomenon in which an aggregate trend across a population reverses direction when the population is stratified by a third variable. The same numeric data supports opposite conclusions depending on the aggregation level — naïvely “Treatment A beats Treatment B in the overall population” coexists with “Treatment B beats Treatment A in every subgroup.” The arithmetic is consistent; the apparent contradiction is real; the paradox is in the interpretation, not the data. The diagnostic question — “is there a third variable correlated with both the exposure and the outcome that might reverse the relationship when conditioned on?” — is the first move. The second move, which the arithmetic alone cannot answer, is “which view is the right answer to the question I am actually asking?” That requires the causal diagram. If the stratifier is a confounder (a common cause of exposure and outcome), the stratified view is correct and the aggregate is misleading. If the stratifier is a mediator or downstream collider, the aggregate may be the correct answer and stratification introduces a different bias. This is the load-bearing subtlety: Simpsons-paradox is not “always trust the stratified view.” It is “the data alone cannot tell you which view to trust; you have to supply the causal story.” Pearl’s causal-inference framework gives the explicit tools (do-calculus, backdoor criterion); the structural primitive in this catalog is the general shape that motivates needing those tools.Triggers
User-initiated: User describes a pattern where the conclusion changes depending on how the data is aggregated, or notices a comparison whose direction reverses when stratified. Vocabulary cues: “Simpson’s paradox,” “aggregate reverses,” “pooled vs stratified,” “lurking variable,” “Berkeley admissions,” “kidney stone.” Agent-initiated: Agent notices an aggregate comparison being drawn without acknowledgment of stratification effects, or a stratified comparison being used to override an aggregate without acknowledgment of causal structure. Candidate inference: “what is the causal diagram here? Is there a third variable that could reverse this when conditioned on, and is the question being asked best answered by the aggregate or the stratified view?” Situation-shape signals: Disagreements over policy effects where one side cites aggregate data and the other cites subgroup data; meta-analyses that combine studies across populations; A/B test interpretations that mix new and returning users; fairness audits on aggregate ML metrics; clinical trial reports that emphasize overall efficacy without subgroup detail. The signal is strongest when the question being answered (causal claim, descriptive claim, predictive claim) is left implicit.Exclusions
- No reversal under stratification — when the aggregate and stratified views agree in direction, there is no paradox; the situation might still have confounding worth conditioning on, but the failure-of-interpretation that Simpsons-paradox names is absent.
- Stratifier is a mediator, not a confounder — if the third variable lies on the causal pathway from exposure to outcome, conditioning on it removes part of the effect being measured. The stratified view is wrong in this case; the aggregate is correct. The arithmetic looks identical to the confounder case; the causal diagram is the difference.
- Stratifier is a collider — conditioning on a common consequence of exposure and outcome introduces selection bias and can produce a spurious reversal. Not Simpsons-paradox proper, but commonly mistaken for it; the corrective is the same (draw the causal diagram).
- Sparse strata where the stratified view is just noisy — within-stratum estimates can disagree with the aggregate purely from sampling variance, not from any causal-structure reversal. The paradox requires the stratified pattern to be robust, not an artifact of underpowered subgroup estimates.
- Question genuinely is the aggregate — sometimes the policy question is “what is the population-level effect?” rather than “what is the individual-level effect?” In those cases the aggregate view is the right answer; using subgroup detail to override it imports a different question than the one being asked.
Structure
Relationships
- confounding — mechanism-level partner. Confounding is what produces the reversal; Simpsons-paradox is the manifest reversal. Reading them together gives the structural shape (confounding) plus the empirical signature (paradox).
- selection-bias — common driver. Berkeley admissions is selection-into-stratum producing the paradox; cohort-composition drift in product analytics is selection-by-time producing the paradox. The pair captures that “selection into stratum” is one of the most reliable real-world generators.
- wisdom-of-crowds — naïve aggregation can be wrong, not just noisy. Wisdom-of-crowds requires the right independence/exchangeability conditions; Simpsons-paradox is the failure mode when those conditions break.
- doctrine — causal-inference doctrines (DAGs, do-calculus, “draw the diagram before pooling,” potential-outcomes framework) exist as structural counter-pressure. Each installs the question the arithmetic cannot answer.
- red-herring — the apparent main effect can be a red-herring at the aggregate level (the aggregate looks load-bearing for the question being asked, but the stratified view is the real answer). The contrast clarifies that internal data structure can mislead as effectively as external misdirection.
- reframe — the resolution often requires reframing the question: “which treatment is better overall?” vs “which treatment is better for this patient?” produce different right answers from the same data. Simpsons-paradox forces the reframe to be made explicit.
Examples
Berkeley graduate admissions (Bickel, Hammel & O'Connell 1975) · statistics
Berkeley graduate admissions (Bickel, Hammel & O'Connell 1975) · statistics
Kidney stone treatment study (Charig et al. 1986) · medicine-and-health
Kidney stone treatment study (Charig et al. 1986) · medicine-and-health
Batting averages across seasons · statistics
Batting averages across seasons · statistics
Bickel, P. J., Hammel, E. A., & O'Connell, J. W. (1975). "Sex bias in graduate admissions: Data from Berkeley." *Science · statistics
Bickel, P. J., Hammel, E. A., & O'Connell, J. W. (1975). "Sex bias in graduate admissions: Data from Berkeley." *Science · statistics
Charig, C. R., Webb, D. R., Payne, S. R., & Wickham, J. E. (1986). "Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy." *British Medical Journal*, 292(6524), 879-882. · medicine-and-health
Charig, C. R., Webb, D. R., Payne, S. R., & Wickham, J. E. (1986). "Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy." *British Medical Journal*, 292(6524), 879-882. · medicine-and-health
COVID case-fatality rates across countries · statistics
COVID case-fatality rates across countries · statistics
Education outcomes pooled across districts · statistics
Education outcomes pooled across districts · statistics
ML fairness across demographic groups · statistics
ML fairness across demographic groups · statistics
Pearl, J. (2009). *Causality: Models, Reasoning, and Inference* (2nd ed.) — the modern causal-inference treatment. · statistics
Pearl, J. (2009). *Causality: Models, Reasoning, and Inference* (2nd ed.) — the modern causal-inference treatment. · statistics
Pearl, J. (2014). "Comment: Understanding Simpson's paradox." *American Statistician*, 68(1), 8-13 — explicit treatment from the causal-DAG perspective. · statistics
Pearl, J. (2014). "Comment: Understanding Simpson's paradox." *American Statistician*, 68(1), 8-13 — explicit treatment from the causal-DAG perspective. · statistics
Product cohort analytics · statistics
Product cohort analytics · statistics
Simpson, E. H. (1951), "The interpretation of interaction in contingency tables," Journal of the Royal Statistical Society Series B 13(2) — the founding paper. · statistics
Simpson, E. H. (1951), "The interpretation of interaction in contingency tables," Journal of the Royal Statistical Society Series B 13(2) — the founding paper. · statistics
Treatment effect heterogeneity in clinical trials · medicine-and-health
Treatment effect heterogeneity in clinical trials · medicine-and-health
Yule, G. U. (1903). "Notes on the theory of association of attributes in statistics." *Biometrika*, 2(2), 121-134 — earl · statistics
Yule, G. U. (1903). "Notes on the theory of association of attributes in statistics." *Biometrika*, 2(2), 121-134 — earl · statistics