Selection bias
Description
The structural failure mode where the sample being analyzed does not represent the population the conclusion is meant to apply to, because the selection mechanism is correlated with the variables being studied. The bias is in the inference, not the data — the sample is what it is; selection-bias is the move of treating the sample’s properties as the target population’s properties. The diagnostic question — “who or what is missing from this sample, and why? Is the selection mechanism correlated with the variables I am studying?” — is the practical test. The first half (who is missing) requires actively imagining the invisible non-sampled members; the second half (why) requires understanding the selection mechanism well enough to know whether it could plausibly correlate with the outcomes of interest. Selection-bias has several common sub-shapes that share the same structural property:- Survivorship bias — only the survivors of some process are observable; the failures are invisible. Wald’s planes; mutual-fund track records (failed funds get closed and removed from databases); successful-startup studies that omit the graveyard.
- Response bias / non-response bias — sample is who chose to respond; their views differ systematically from non-respondents. Voluntary online polls, opt-in surveys, voluntary product feedback.
- Self-selection — participants choose whether to be exposed to the treatment being studied; the choice correlates with outcome. Voluntary program-evaluation; gym-membership-and-health studies; observational studies of any opt-in intervention.
- Berkson’s bias / collider bias — conditioning on a common consequence of two variables creates a spurious association where none existed in the underlying population. Hospital-based case-control studies; ML training on customer-support tickets; selection of high-performers from any noisy evaluation.
- Attrition / loss-to-follow-up — sample members drop out during the study, and the drop-out is correlated with outcome. Longitudinal studies; clinical trials; multi-year cohort analyses.
Triggers
User-initiated: User describes a conclusion drawn from a sample and is asking whether it generalizes to a broader population, or notices that a claim rests on a non-random sample. Vocabulary cues: “selection bias,” “survivorship bias,” “self-selection,” “who is missing,” “non-response,” “representative sample,” “the planes that returned.” Agent-initiated: Agent notices an inference being made from a sample to a broader population without acknowledgment of the selection mechanism. Candidate inference: “what is the selection mechanism here, and is it plausibly correlated with the variables being studied? Who is invisible in this dataset?” Situation-shape signals: Studies of successful examples without paired study of failures; survey-based claims with unreported response rates; cohort analyses with unaddressed attrition; observational claims about treatments with self-selected enrollment; security-statistics claims based on detected events; historical generalizations from winners; ML training-data discussions that do not address representativeness.Exclusions
- Random sampling with high response rate from a defined frame — when the sampling design is properly random and response is near-complete, selection-bias does not operate. The bias requires either non-random selection or differential response correlated with outcome.
- Conclusion is restricted to the sample, not generalized to a broader population — sometimes the analysis really is about the sample (this cohort, this set of customers, this incident set); generalization is not being claimed, so the inference gap selection-bias names does not exist. Diagnostic: “does the conclusion make a claim about the broader population, or only about the data in hand?”
- Census or near-census data — when the sample is the population, there is no selection gap to bias the inference. Caveat: even census data is selection-biased toward the time it was taken (excludes the dead and the unborn).
- Selection correlated with covariates but not with the outcome conditional on the covariates — when the selection mechanism is associated with measured variables but not with the outcome given those variables, statistical adjustment can resolve the bias. The phenomenon still exists but does not bias the conclusion after appropriate weighting/conditioning.
- Randomized post-selection — when within a selected sample, treatment is randomly assigned, the within-sample treatment-effect estimate is unbiased; generalization to the broader population may still be biased, but the internal-validity inference is clean.
- Differences in measurement quality rather than sample composition — measurement bias is a different failure mode (the data is wrong) whereas selection-bias is about who is in the sample. Often co-occur in practice but conceptually distinct.
Structure
Relationships
- collider-bias — collider-bias is the specific mechanism (conditioning on a common effect) behind one family of selection bias; selection-bias is the general sample-not-equal-population family.
- confounding — sibling causal-inference failure mode; together they cover the two big families of non-causal observational associations. Confounding works through common parents; selection-bias through conditioning on common children. Both require causal-structural analysis to detect.
- simpsons-paradox — selection-into-stratum is one of the most common drivers. The Berkeley case, the cohort-composition drift in product analytics — each is selection-bias producing the paradox.
- wisdom-of-crowds — load-bearing failure mode. Wisdom-of-crowds requires representative sampling of the underlying population; selection-bias is the corruption of that assumption.
- doctrine — survey methodology, clinical-trial design, epidemiological case-control matching, ML evaluation set discipline. Each is structural counter-pressure against selection-bias in its domain.
- red-herring — both produce wrong conclusions about what is load-bearing; the correctives differ. Red-herring is external; selection-bias is structural. Distinguishing them is part of analytic diagnosis.
- cargo-cult — selection-bias is one of the engines that lets cargo-cult survive: studying only the survivors of a strategy and inferring the strategy caused the survival.
- reframe — the corrective often requires reframing the question from “what is true in this sample?” to “what is the selection mechanism here, and how does the answer change once we account for it?” The reframe surfaces the inference gap that the naïve analysis ignored.
Examples
Wald's wartime aircraft-armor analysis (1943) · statistics
Wald's wartime aircraft-armor analysis (1943) · statistics
Successful-startup / "Good to Great" studies · business
Successful-startup / "Good to Great" studies · business
Berkeley graduate admissions (1973) · statistics
Berkeley graduate admissions (1973) · statistics
Berkson, J. (1946). "Limitations of the application of fourfold table analysis to hospital data." *Biometrics Bulletin* 2(3) 47-53 — the canonical statement of what is now called Berkson's bias / collider stratification bias. · medicine-and-health
Berkson, J. (1946). "Limitations of the application of fourfold table analysis to hospital data." *Biometrics Bulletin* 2(3) 47-53 — the canonical statement of what is now called Berkson's bias / collider stratification bias. · medicine-and-health
Clinical-trial attrition · medicine-and-health
Clinical-trial attrition · medicine-and-health
Customer-support ticket data as ML training set · computer-science
Customer-support ticket data as ML training set · computer-science
Ellenberg, J. (2014). *How Not to Be Wrong: The Power of Mathematical Thinking* — Wald case popularization. · statistics
Ellenberg, J. (2014). *How Not to Be Wrong: The Power of Mathematical Thinking* — Wald case popularization. · statistics
Evolutionary biology: only surviving lineages observable · biology
Evolutionary biology: only surviving lineages observable · biology
Heckman, J. J. (1979). "Sample selection bias as a specification error." *Econometrica*, 47(1), 153-161 — econometric correction (Heckman selection model). · statistics
Heckman, J. J. (1979). "Sample selection bias as a specification error." *Econometrica*, 47(1), 153-161 — econometric correction (Heckman selection model). · statistics
Hernán, M. A., Hernández-Díaz, S., & Robins, J. M. (2004). "A structural approach to selection bias." *Epidemiology*, 15 · medicine-and-health
Hernán, M. A., Hernández-Díaz, S., & Robins, J. M. (2004). "A structural approach to selection bias." *Epidemiology*, 15 · medicine-and-health
Mutual-fund track-record databases · economics
Mutual-fund track-record databases · economics
Rosenzweig, P. (2007). *The Halo Effect: ... and the Eight Other Business Delusions That Deceive Managers* — business-st · business
Rosenzweig, P. (2007). *The Halo Effect: ... and the Eight Other Business Delusions That Deceive Managers* — business-st · business
Security incident statistics · computer-science
Security incident statistics · computer-science
Voluntary online polls · statistics
Voluntary online polls · statistics
Wald, A. (1943, declassified 1980s). Statistical Research Group memorandum on aircraft armor — the canonical survivorship-bias case. · statistics
Wald, A. (1943, declassified 1980s). Statistical Research Group memorandum on aircraft armor — the canonical survivorship-bias case. · statistics