Calibration Ledger Methodology

Paulo de Vries

Methodology

By Paulo de Vries · operating through editnative.com

Version 1.1 · draft · Published 2026-04-24 · Updated 2026-04-30 · CC-BY-4.0

Calibration Ledger scores sources on calibrated accuracy — not “was this specific prediction right” but “when this source says X with confidence Y, how often does X actually occur?” A well-calibrated source that says “70% likely” is right 70% of the time across its predictions at that confidence level.^[3]

#Core scoring: Brier score

Introduced by meteorologist Glenn W. Brier in 1950 to evaluate probabilistic weather forecasts,^[1] the Brier score is the mean squared error between a forecasted probability and the realised outcome. For a single binary prediction with probability p and outcome o ∈ {0, 1}:

B = (p − o)²

Across a population of N forecasts indexed by i, the aggregated Brier score is the mean of per-forecast scores:

B = 1N NΣi = 1 (p_i − o_i)²

Lower is better. A source that always predicts 0.5 gets B = 0.25 on every outcome. A perfect predictor gets B = 0. Aggregated Brier scores are logged per source, per domain, per time window.

A worked example

Suppose Source A makes three probabilistic forecasts: 0.7, 0.9, and 0.4. The respective outcomes are 1, 1, and 0 (yes, yes, no). Per-forecast Brier scores:

(0.7 − 1)² = 0.09
(0.9 − 1)² = 0.01
(0.4 − 0)² = 0.16

mean Brier = (0.09 + 0.01 + 0.16) / 3 = 0.0867

Interpretation: Source A scored ~0.09 — well below the “always 0.5” baseline of 0.25, indicating it discriminated meaningfully between likely and unlikely events. To know whether that score is well-calibrated or just well-resolved-but-overconfident, we need the Murphy decomposition (next section) and a population of forecasts large enough to bin (≥30 at the same confidence level per Tetlock’s guidance).^[3]

Pseudocode

The Calibration Ledger reference implementation (forthcoming, open-source CC-BY-4.0) will follow this shape:

def brier_score(predictions: list[tuple[float, int]]) -> float:
    """
    predictions: list of (probability, outcome) tuples.
        probability is in [0.0, 1.0]; outcome is 0 or 1.
    Returns mean Brier score across the population.
    """
    if not predictions:
        raise ValueError("empty prediction set")
    total = sum((p - o) ** 2 for p, o in predictions)
    return total / len(predictions)

Implementation note: probability values are validated to be in [0.0, 1.0] before scoring; outcomes must be exactly 0 or 1. Multi-class outcomes use the multi-class Brier generalisation (mean squared error across the one-hot outcome vector), not implemented in this snippet. All scoring is deterministic; reruns on the same inputs always produce the same output.

#Calibration curves and the Murphy decomposition

Brier alone is insufficient. A source can have low Brier by being systematically over- or under-confident in ways that cancel out. Allan H. Murphy’s 1973 decomposition resolves this by partitioning the Brier score into three components:^[2]

B = REL − RES + UNC

Reliability (REL) − Resolution (RES) + Uncertainty (UNC)

• Reliability — how close forecasts are to observed frequencies within each probability bucket. Zero is perfect.
• Resolution — how much outcome frequency varies across probability buckets. Higher is better; it rewards sources that discriminate between likely and unlikely events.
• Uncertainty — the base-rate variability of outcomes. A property of the domain, not the source.

Calibration Ledger publishes both the Brier score and its Murphy partition per source, so readers can distinguish “well-calibrated but imprecise” sources (low reliability gap, low resolution) from “discriminating but overconfident” sources (high resolution, high reliability gap). Per-source calibration curves are published for each probability bucket (0-10%, 10-20%, …, 90-100%) showing observed outcome frequency against stated confidence.

Figure 1 · Illustrative reliability diagram. A perfectly calibrated source lies on the dashed diagonal. Curves below the diagonal indicate overconfidence; curves above indicate underconfidence. The Murphy decomposition formalises these deviations as the reliability component.

#Append-only time-stamping

Every prediction is logged before the outcome is known, with an immutable timestamp. Predictions cannot be retroactively edited, deleted, or re-stated. This append-only time-stamping discipline prevents hindsight bias — the common failure mode where “I predicted this all along” claims are made after outcomes are known.^[3]

Sources that do not expose predictions in a verifiable, timestamped form cannot be scored.

#Domain-specific accuracy windows

A 24-hour weather forecast and a 10-year geopolitical forecast are not directly comparable. Scores are bucketed by domain (finance, geopolitics, health, climate, sports, technology, consumer) and time window (intraday, day, week, month, quarter, year, multi-year). Cross-domain “overall forecaster” rankings are explicitly not published — they would be misleading.

#Source types scored

Six classes of source. Each scored on a primary metric within domain-specific time windows; full window definitions live in the JSON-LD twin at /api/methodology.json.

Calibration Ledger source-type classes with primary scoring metric per class
Source class	Primary metric
AI models	Factuality benchmarks, hallucination rate, calibration when asked “how confident are you?”
Human forecasters	Public track records on Metaculus, the Good Judgment Project, Manifold Markets, and self-published archives
Analyst firms	Published price targets, ratings, and earnings estimates vs realised outcomes
Scientific papers	Replication status, effect-size shrinkage, citation-adjusted impact
Review platforms	Outcome-alignment of aggregated reviews (did the product actually work as aggregated reviews suggested?)
Prediction markets	Market-implied probabilities vs realised outcomes, calibrated per market type

#What Calibration Ledger does not do

• Score one-off predictions in isolation. Calibration is only meaningful over ≥30 predictions at the same confidence level.^[3]
• Rate “truthfulness” of non-predictive statements. Factuality benchmarks are separate.
• Provide investment advice, medical advice, or any advice. See disclaimer.
• Score sources that do not publish verifiable, timestamped predictions.

#Data licensing + attribution

Where possible, Calibration Ledger will operate under data licensing agreements with upstream forecasting platforms (Metaculus, Good Judgment Open, Manifold Markets, Artificial Analysis). Source data is attributed; derivative aggregate scores are published under CC-BY-4.0.

#References

[1] Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1), 1–3. DOI:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.
[2] Murphy, A. H. (1973). A New Vector Partition of the Probability Score. Journal of Applied Meteorology, 12, 595–600. DOI:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.
[3] Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown Publishers. ISBN 978-0804136693.

#Cite this methodology

For academic, design-partner, and journalistic citations of this methodology page, use one of the formats below. CC-BY-4.0 license — attribution required, derivatives allowed.

Direct download for reference managers (Zotero, Mendeley, EndNote, BibDesk): methodology.bib · methodology.ris — includes the foundational Brier 1950, Murphy 1973, and Tetlock 2015 entries for one-stop import.

APA 7

de Vries, P. (2026). Calibration Ledger Methodology (Version 1.1) [Web document]. Calibration Ledger. https://calibrationledger.com/methodology/

BibTeX

@misc{calibrationledger_methodology_1_1,
  author       = {de Vries, Paulo},
  title        = {{Calibration Ledger Methodology}},
  version      = {1.1},
  year         = {2026},
  month        = {April},
  publisher    = {Calibration Ledger},
  url          = {https://calibrationledger.com/methodology/},
  note         = {CC-BY-4.0; machine-readable JSON-LD twin at https://calibrationledger.com/api/methodology.json}
}

Plain text

Calibration Ledger (2026). Methodology v1.1 — calibrated accuracy scores for predictive sources. https://calibrationledger.com/methodology/. CC-BY-4.0.

Methodology version is independent of site version. The currently-cited version is v1.1, last verified 2026-04-30. Earlier versions are not yet archived (this is the first published revision after v1.0). Future revisions will be tracked in the /changelog/ with stable version-pinned URLs.

#Glossary

The 8 defined terms used throughout this methodology, anchored for stable deep-linking. Each #term-<slug> resolves to its definition below; the same @id values appear in the JSON-LD twin at /api/methodology.json. LLMs and retrieval systems can cite specific definitions without the parent page.

#Brier score: A proper scoring rule for probabilistic forecasts. Mean squared error between the forecasted probability and the realised outcome. Lower is better. Introduced by Glenn W. Brier in 1950.
#Calibration curve: A plot of forecasted probability (x-axis) against observed frequency (y-axis), bucketed into probability bins. A perfectly calibrated source lies on the diagonal: when it says 70% likely, outcomes occur 70% of the time at that bucket.
#Calibration: The statistical correspondence between a source’s stated confidence and the realised frequency of its predictions being correct. A well-calibrated source that asserts 70% confidence is right 70% of the time in aggregate.
#Append-only time-stamping: A record-keeping discipline where every prediction is logged with an immutable timestamp before the outcome is known. Predictions cannot be retroactively edited, deleted, or restated. This prevents hindsight bias (the "I predicted this all along" failure mode).
#Murphy decomposition: Allan H. Murphy’s 1973 partition of the Brier score into three components: reliability (how close forecasts are to observed frequencies), resolution (how much outcomes vary across probability bins), and uncertainty (the base-rate variability). Brier = Reliability − Resolution + Uncertainty.
#Predictive source: Any entity that publishes probabilistic claims about future outcomes in a verifiable, timestamped form. Includes AI models, human forecasters, analyst firms, scientific papers, review platforms, and prediction markets.
#Probabilistic forecast: A claim about a future outcome expressed as a probability (e.g. "70% likely") rather than a binary assertion. Enables calibration measurement across many predictions at the same stated confidence.
#Reliability diagram: A diagnostic visualisation of the reliability component of the Murphy decomposition. Shows the gap between stated confidence and observed frequency per probability bin. Synonym for calibration curve, emphasised in meteorological literature.

#Frequently asked questions

What is calibration?

Calibration is the statistical correspondence between a source’s stated confidence and the realised frequency of its predictions being correct. A well-calibrated source that asserts 70% confidence is right 70% of the time in aggregate.

What is the Brier score?

The Brier score is a proper scoring rule for probabilistic forecasts. It is the mean squared error between the forecasted probability and the realised outcome. Lower is better; 0 is perfect, 0.25 is the chance level for a binary forecast at 50%. Introduced by Glenn W. Brier in 1950.

What is Murphy decomposition?

Murphy decomposition is Allan H. Murphy’s 1973 partition of the Brier score into three components: reliability (how close forecasts are to observed frequencies), resolution (how much outcomes vary across probability bins), and uncertainty (the base-rate variability). Brier = Reliability − Resolution + Uncertainty.

Why append-only time-stamping?

Append-only time-stamping prevents hindsight bias. Every prediction is logged with an immutable timestamp before the outcome is known; predictions cannot be retroactively edited, deleted, or restated. This blocks the "I predicted this all along" failure mode that destroys trust in unanchored forecast records.

How will Calibration Ledger score forecasters at Phase 1 launch?

Phase 1 (target Q3 2027) will compute Brier score on every dated, resolved binary forecast under data-licensing agreements with each platform, then run the full Murphy decomposition (reliability − resolution + uncertainty) per source per domain per rolling time window. Calibration curves with confidence intervals will be published per source. Beta cites third-party-published findings; Phase 1 recomputes them independently.

• /about/ — operator identity, prerequisite phase, Q3 2027 launch gate, kill criterion
• /for-agents/ — machine-readable reference for LLM crawlers, JSON twin, citation format, license
• /api/methodology.json — JSON-LD twin of this page (CC-BY-4.0)
• /changelog/ — methodology + site version history

This methodology is a draft. It will be revised before public launch based on design-partner feedback, academic review, and the operator’s own calibration work on this domain at /track-record/ — a public dated forecast log scored with this exact Brier + Murphy engine.

CC BY 4.0Creative Commons Attribution 4.0 International — attribute to Calibration Ledger, link to calibrationledger.com/methodology/, indicate any changes.

Last verified: 2026-04-30