Skip to main content
computer-science engineering-and-technology psychology

Calibration

Description

Calibration is the operation that makes an instrument’s or a judgment’s outputs trustworthy at face value by aligning them against a reference standard taken to be true. A measuring device produces raw indications; calibration establishes the relation between those indications and the true quantity values supplied by a standard, and quantifies the residual uncertainty that remains. The same shape appears wherever outputs must be trusted: a classifier’s predicted probabilities are calibrated when a “0.7” really does come true 70% of the time, and a forecaster is calibrated when their stated confidences match realized event frequencies over many trials. The diagnostic question — “against what trusted reference is this output aligned, and how big is the residual error?” — separates a calibrated source (trust the number) from an uncalibrated one (the number needs correction first). The reference is always external: internal consistency (a device that repeats the same reading) is precision, not calibration. A precise-but-uncalibrated instrument is reliably wrong.

Triggers

User-initiated: User asks whether a measurement, score, model output, or confidence can be trusted as-is, or describes aligning a tool against a known standard (“calibrate the sensor,” “the model is overconfident,” “what’s our ground truth”). Agent-initiated: Agent notices outputs are being trusted at face value without a named reference, or that stated confidences don’t match observed outcomes. Candidate inference: “what reference is this aligned against, and when was it last checked?” Situation-shape signals: A device or judge emitting values consumed downstream as truth; a probability/confidence stream whose reliability can be checked against realized frequencies; a traceability chain back to a master standard.

Exclusions

  • Drift — losing alignment over time is the failure calibration corrects, not calibration itself. The corrective-vs-failure-mode pair is the sharpest boundary; see drift.
  • Error-correctionerror-correction reconstructs corrupted content from in-band redundancy; calibration tunes the apparatus against an out-of-band reference.
  • Mean-reversionmean-reversion is a passive restoring force; calibration is an active performed alignment. No dynamic pulls the instrument back on its own.
  • Accuracy without a reference — internal consistency (precision, repeatability) is not calibration. The comparison-to-an-external-trusted-standard is constitutive.

Structure

Internal structure of calibration: a table of its component slots and the concepts that fill them.

Relationships

Relationship neighborhood of calibration: a graph of the concepts it connects to and the concepts it is a part of.
  • drift — corrective and failure-mode. Drift is what calibration fixes; the interval between calibrations is set by how fast drift accumulates relative to tolerance.
  • error-correction — both yield trustworthy values from imperfect ones; the axis is in-band-redundancy vs out-of-band-reference.
  • similarity — calibration presupposes the reference is genuinely comparable to the measured quantity; a mismatched standard yields a meaningless alignment.

Examples

JCGM 200:2012, "International Vocabulary of Metrology — Basic and General Concepts and Associated Terms (VIM)", 3rd edition, Joint Committee for Guides in Metrology / BIPM · engineering-and-technology

In metrology, calibration is the operation that, under specified conditions, establishes a relation between the quantity values provided by measurement standards — traceable through an unbroken chain back to the SI base units — and the corresponding indications a working instrument produces, together with the associated measurement uncertainty. A factory thermometer is not trusted because it is consistent; it is trusted because its readings have been compared against a reference whose own values are traceable to a national standard, and the residual error has been characterized.Inference: Trust in a measurement is not a property of the instrument alone but of an explicit chain to a shared reference. “Is this number right?” reduces to “what was it last calibrated against, and how big is the stated uncertainty?” — an instrument with no traceable reference has no defensible claim to accuracy, only to repeatability.

Niculescu-Mizil, A. & Caruana, R., "Predicting Good Probabilities with Supervised Learning", Proceedings of the 22nd International Conference on Machine Learning (ICML 2005), pp. 625–632 · computer-science

A machine-learning classifier can be accurate in its decisions yet badly calibrated in its confidences: when it outputs “0.9 probability,” the predicted event may actually occur only 60% of the time. Niculescu-Mizil and Caruana studied this systematically across learning algorithms, showing some (boosted trees, SVMs) produce characteristically distorted probability estimates, and that aligning the raw scores against held-out observed frequencies — Platt scaling, isotonic regression — restores the property that a stated probability matches the empirical rate of occurrence.Inference: The model’s score head is an instrument; the realized outcome frequencies are the reference standard. Calibration here is the post-hoc mapping from raw scores to trustworthy probabilities. An uncalibrated model is not “inaccurate” — it can rank correctly — it is consistently miscalibrated, exactly the precise-but-unaligned failure: the number can’t be taken at face value until mapped against observed frequencies.

Lichtenstein, S., Fischhoff, B. & Phillips, L. D., "Calibration of Probabilities: The State of the Art to 1980", in Kahneman, Slovic & Tversky (eds.), Judgment Under Uncertainty: Heuristics and Biases (Cambridge University Press, 1982), pp. 306–334 · psychology

A human judge is calibrated when, across many assessments, the events they assign 70% confidence to actually happen about 70% of the time. Lichtenstein, Fischhoff and Phillips reviewed the early literature and found people are systematically overconfident: events assigned 90% confidence occur far less often than 90% of the time. The judge’s subjective probabilities are the instrument; the realized base-rate frequencies are the reference standard against which calibration is measured.Inference: Calibration transfers cleanly from physical instruments to judging minds — the same structure of instrument-aligned-to-reference holds, and the same failure (confident-but-uncalibrated) recurs. The corrective is the same: feed back realized outcomes against stated confidences, so the judge re-anchors. Expert forecasting training is, structurally, instrument calibration applied to a person.