Methodology

Version 1.1 · draft · ·

Calibration Ledger scores sources on calibrated accuracy — not “was this specific prediction right” but “when this source says X with confidence Y, how often does X actually occur?” A well-calibrated source that says “70% likely” is right 70% of the time across its predictions at that confidence level.[3]

#Core scoring: Brier score

Introduced by meteorologist Glenn W. Brier in 1950 to evaluate probabilistic weather forecasts,[1] the Brier score is the mean squared error between a forecasted probability and the realised outcome. For binary predictions with probability p and outcome o ∈ {0, 1}:

Brier = (p − o)²

Lower is better. A source that always predicts 0.5 gets Brier = 0.25 on every outcome. A perfect predictor gets Brier = 0. Aggregated Brier scores are logged per source, per domain, per time window.

A worked example

Suppose Source A makes three probabilistic forecasts: 0.7, 0.9, and 0.4. The respective outcomes are 1, 1, and 0 (yes, yes, no). Per-forecast Brier scores:

(0.7 − 1)² = 0.09
(0.9 − 1)² = 0.01
(0.4 − 0)² = 0.16

mean Brier = (0.09 + 0.01 + 0.16) / 3 = 0.0867

Interpretation: Source A scored ~0.09 — well below the “always 0.5” baseline of 0.25, indicating it discriminated meaningfully between likely and unlikely events. To know whether that score is well-calibrated or just well-resolved-but-overconfident, we need the Murphy decomposition (next section) and a population of forecasts large enough to bin (≥30 at the same confidence level per Tetlock’s guidance).[3]

Pseudocode

The Calibration Ledger reference implementation (forthcoming, open-source CC-BY-4.0) will follow this shape:

def brier_score(predictions: list[tuple[float, int]]) -> float:
    """
    predictions: list of (probability, outcome) tuples.
        probability is in [0.0, 1.0]; outcome is 0 or 1.
    Returns mean Brier score across the population.
    """
    if not predictions:
        raise ValueError("empty prediction set")
    total = sum((p - o) ** 2 for p, o in predictions)
    return total / len(predictions)

Implementation note: probability values are validated to be in [0.0, 1.0] before scoring; outcomes must be exactly 0 or 1. Multi-class outcomes use the multi-class Brier generalisation (mean squared error across the one-hot outcome vector), not implemented in this snippet. All scoring is deterministic; reruns on the same inputs always produce the same output.

#Calibration curves and the Murphy decomposition

Brier alone is insufficient. A source can have low Brier by being systematically over- or under-confident in ways that cancel out. Allan H. Murphy’s 1973 decomposition resolves this by partitioning the Brier score into three components:[2]

Brier = Reliability − Resolution + Uncertainty
  • Reliability — how close forecasts are to observed frequencies within each probability bucket. Zero is perfect.
  • Resolution — how much outcome frequency varies across probability buckets. Higher is better; it rewards sources that discriminate between likely and unlikely events.
  • Uncertainty — the base-rate variability of outcomes. A property of the domain, not the source.

Calibration Ledger publishes both the Brier score and its Murphy partition per source, so readers can distinguish “well-calibrated but imprecise” sources (low reliability gap, low resolution) from “discriminating but overconfident” sources (high resolution, high reliability gap). Per-source calibration curves are published for each probability bucket (0-10%, 10-20%, …, 90-100%) showing observed outcome frequency against stated confidence.

Reliability diagram — three example sourcesA square plot with stated confidence on the x-axis from 0 to 1 and observed frequency on the y-axis from 0 to 1. A dashed diagonal represents perfect calibration. A blue curve that tracks the diagonal represents a well-calibrated source. An amber curve that stays below the diagonal represents an overconfident source: when it claims 90% confidence, it is only right about 60% of the time. A green curve that stays above the diagonal represents an underconfident source: when it claims 30% confidence, it is actually right 45% of the time.Stated confidenceObserved frequency00.5100.51Perfect calibrationWell-calibratedOverconfidentUnderconfident
Figure 1 · Illustrative reliability diagram. A perfectly calibrated source lies on the dashed diagonal. Curves below the diagonal indicate overconfidence; curves above indicate underconfidence. The Murphy decomposition formalises these deviations as the reliability component.

#Append-only time-stamping

Every prediction is logged before the outcome is known, with an immutable timestamp. Predictions cannot be retroactively edited, deleted, or re-stated. This append-only time-stamping discipline prevents hindsight bias — the common failure mode where “I predicted this all along” claims are made after outcomes are known.[3]

Sources that do not expose predictions in a verifiable, timestamped form cannot be scored.

#Domain-specific accuracy windows

A 24-hour weather forecast and a 10-year geopolitical forecast are not directly comparable. Scores are bucketed by domain (finance, geopolitics, health, climate, sports, technology, consumer) and time window (intraday, day, week, month, quarter, year, multi-year). Cross-domain “overall forecaster” rankings are explicitly not published — they would be misleading.

#Source types scored

Six classes of source. Each scored on a primary metric within domain-specific time windows; full window definitions live in the JSON-LD twin at /api/methodology.json.

Calibration Ledger source-type classes with primary scoring metric per class
Source classPrimary metric
AI modelsFactuality benchmarks, hallucination rate, calibration when asked “how confident are you?”
Human forecastersPublic track records on Metaculus, the Good Judgment Project, Manifold Markets, and self-published archives
Analyst firmsPublished price targets, ratings, and earnings estimates vs realised outcomes
Scientific papersReplication status, effect-size shrinkage, citation-adjusted impact
Review platformsOutcome-alignment of aggregated reviews (did the product actually work as aggregated reviews suggested?)
Prediction marketsMarket-implied probabilities vs realised outcomes, calibrated per market type

#What Calibration Ledger does not do

  • • Score one-off predictions in isolation. Calibration is only meaningful over ≥30 predictions at the same confidence level.[3]
  • • Rate “truthfulness” of non-predictive statements. Factuality benchmarks are separate.
  • • Provide investment advice, medical advice, or any advice. See disclaimer.
  • • Score sources that do not publish verifiable, timestamped predictions.

#Data licensing + attribution

Where possible, Calibration Ledger will operate under data licensing agreements with upstream forecasting platforms (Metaculus, Good Judgment Open, Manifold Markets, Artificial Analysis). Source data is attributed; derivative aggregate scores are published under CC-BY-4.0.

#References

  1. [1] Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1), 1–3. DOI:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.
  2. [2] Murphy, A. H. (1973). A New Vector Partition of the Probability Score. Journal of Applied Meteorology, 12, 595–600. DOI:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.
  3. [3] Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown Publishers. ISBN 978-0804136693.

#Cite this methodology

For academic, design-partner, and journalistic citations of this methodology page, use one of the formats below. CC-BY-4.0 license — attribution required, derivatives allowed.

Direct download for reference managers (Zotero, Mendeley, EndNote, BibDesk): methodology.bib · methodology.ris — includes the foundational Brier 1950, Murphy 1973, and Tetlock 2015 entries for one-stop import.

APA 7

de Vries, P. (2026). Calibration Ledger Methodology (Version 1.1) [Web document]. Calibration Ledger. https://calibrationledger.com/methodology/

BibTeX

@misc{calibrationledger_methodology_1_1,
  author       = {de Vries, Paulo},
  title        = {{Calibration Ledger Methodology}},
  version      = {1.1},
  year         = {2026},
  month        = {April},
  publisher    = {Calibration Ledger},
  url          = {https://calibrationledger.com/methodology/},
  note         = {CC-BY-4.0; machine-readable JSON-LD twin at https://calibrationledger.com/api/methodology.json}
}

Plain text

Calibration Ledger (2026). Methodology v1.1 — calibrated accuracy scores for predictive sources. https://calibrationledger.com/methodology/. CC-BY-4.0.

Methodology version is independent of site version. The currently-cited version is v1.1, last verified 2026-04-24. Earlier versions are not yet archived (this is the first published revision after v1.0). Future revisions will be tracked in the /changelog/ with stable version-pinned URLs.

  • /about/ — operator identity, prerequisite phase, Q3 2027 launch gate, kill criterion
  • /for-agents/ — machine-readable reference for LLM crawlers, JSON twin, citation format, license
  • /api/methodology.json — JSON-LD twin of this page (CC-BY-4.0)
  • /changelog/ — methodology + site version history

This methodology is a draft. It will be revised before public launch based on design-partner feedback, academic review, and operator’s own calibration work on ForecastLens.

CC BY 4.0Creative Commons Attribution 4.0 International — attribute to Calibration Ledger, link to calibrationledger.com/methodology/, indicate any changes.

Last verified: 2026-04-24