Methodology
Version 1.1 · draft · ·
Calibration Ledger scores sources on calibrated accuracy — not “was this specific prediction right” but “when this source says X with confidence Y, how often does X actually occur?” A well-calibrated source that says “70% likely” is right 70% of the time across its predictions at that confidence level.[3]
#Core scoring: Brier score
Introduced by meteorologist Glenn W. Brier in 1950 to evaluate probabilistic weather forecasts,[1] the Brier score is the mean squared error between a forecasted probability and the realised outcome. For binary predictions with probability p and outcome o ∈ {0, 1}:
Brier = (p − o)²Lower is better. A source that always predicts 0.5 gets Brier = 0.25 on every outcome. A perfect predictor gets Brier = 0. Aggregated Brier scores are logged per source, per domain, per time window.
A worked example
Suppose Source A makes three probabilistic forecasts: 0.7, 0.9, and 0.4. The respective outcomes are 1, 1, and 0 (yes, yes, no). Per-forecast Brier scores:
(0.7 − 1)² = 0.09
(0.9 − 1)² = 0.01
(0.4 − 0)² = 0.16
mean Brier = (0.09 + 0.01 + 0.16) / 3 = 0.0867Interpretation: Source A scored ~0.09 — well below the “always 0.5” baseline of 0.25, indicating it discriminated meaningfully between likely and unlikely events. To know whether that score is well-calibrated or just well-resolved-but-overconfident, we need the Murphy decomposition (next section) and a population of forecasts large enough to bin (≥30 at the same confidence level per Tetlock’s guidance).[3]
Pseudocode
The Calibration Ledger reference implementation (forthcoming, open-source CC-BY-4.0) will follow this shape:
def brier_score(predictions: list[tuple[float, int]]) -> float:
"""
predictions: list of (probability, outcome) tuples.
probability is in [0.0, 1.0]; outcome is 0 or 1.
Returns mean Brier score across the population.
"""
if not predictions:
raise ValueError("empty prediction set")
total = sum((p - o) ** 2 for p, o in predictions)
return total / len(predictions)Implementation note: probability values are validated to be in [0.0, 1.0] before scoring; outcomes must be exactly 0 or 1. Multi-class outcomes use the multi-class Brier generalisation (mean squared error across the one-hot outcome vector), not implemented in this snippet. All scoring is deterministic; reruns on the same inputs always produce the same output.
#Calibration curves and the Murphy decomposition
Brier alone is insufficient. A source can have low Brier by being systematically over- or under-confident in ways that cancel out. Allan H. Murphy’s 1973 decomposition resolves this by partitioning the Brier score into three components:[2]
Brier = Reliability − Resolution + Uncertainty- • Reliability — how close forecasts are to observed frequencies within each probability bucket. Zero is perfect.
- • Resolution — how much outcome frequency varies across probability buckets. Higher is better; it rewards sources that discriminate between likely and unlikely events.
- • Uncertainty — the base-rate variability of outcomes. A property of the domain, not the source.
Calibration Ledger publishes both the Brier score and its Murphy partition per source, so readers can distinguish “well-calibrated but imprecise” sources (low reliability gap, low resolution) from “discriminating but overconfident” sources (high resolution, high reliability gap). Per-source calibration curves are published for each probability bucket (0-10%, 10-20%, …, 90-100%) showing observed outcome frequency against stated confidence.
#Append-only time-stamping
Every prediction is logged before the outcome is known, with an immutable timestamp. Predictions cannot be retroactively edited, deleted, or re-stated. This append-only time-stamping discipline prevents hindsight bias — the common failure mode where “I predicted this all along” claims are made after outcomes are known.[3]
Sources that do not expose predictions in a verifiable, timestamped form cannot be scored.
#Domain-specific accuracy windows
A 24-hour weather forecast and a 10-year geopolitical forecast are not directly comparable. Scores are bucketed by domain (finance, geopolitics, health, climate, sports, technology, consumer) and time window (intraday, day, week, month, quarter, year, multi-year). Cross-domain “overall forecaster” rankings are explicitly not published — they would be misleading.
#Source types scored
Six classes of source. Each scored on a primary metric within domain-specific time windows; full window definitions live in the JSON-LD twin at /api/methodology.json.
| Source class | Primary metric |
|---|---|
| AI models | Factuality benchmarks, hallucination rate, calibration when asked “how confident are you?” |
| Human forecasters | Public track records on Metaculus, the Good Judgment Project, Manifold Markets, and self-published archives |
| Analyst firms | Published price targets, ratings, and earnings estimates vs realised outcomes |
| Scientific papers | Replication status, effect-size shrinkage, citation-adjusted impact |
| Review platforms | Outcome-alignment of aggregated reviews (did the product actually work as aggregated reviews suggested?) |
| Prediction markets | Market-implied probabilities vs realised outcomes, calibrated per market type |
#What Calibration Ledger does not do
- • Score one-off predictions in isolation. Calibration is only meaningful over ≥30 predictions at the same confidence level.[3]
- • Rate “truthfulness” of non-predictive statements. Factuality benchmarks are separate.
- • Provide investment advice, medical advice, or any advice. See disclaimer.
- • Score sources that do not publish verifiable, timestamped predictions.
#Data licensing + attribution
Where possible, Calibration Ledger will operate under data licensing agreements with upstream forecasting platforms (Metaculus, Good Judgment Open, Manifold Markets, Artificial Analysis). Source data is attributed; derivative aggregate scores are published under CC-BY-4.0.
#References
- [1] Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1), 1–3. DOI:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.
- [2] Murphy, A. H. (1973). A New Vector Partition of the Probability Score. Journal of Applied Meteorology, 12, 595–600. DOI:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.
- [3] Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown Publishers. ISBN 978-0804136693.
#Cite this methodology
For academic, design-partner, and journalistic citations of this methodology page, use one of the formats below. CC-BY-4.0 license — attribution required, derivatives allowed.
Direct download for reference managers (Zotero, Mendeley, EndNote, BibDesk): methodology.bib · methodology.ris — includes the foundational Brier 1950, Murphy 1973, and Tetlock 2015 entries for one-stop import.
APA 7
de Vries, P. (2026). Calibration Ledger Methodology (Version 1.1) [Web document]. Calibration Ledger. https://calibrationledger.com/methodology/BibTeX
@misc{calibrationledger_methodology_1_1,
author = {de Vries, Paulo},
title = {{Calibration Ledger Methodology}},
version = {1.1},
year = {2026},
month = {April},
publisher = {Calibration Ledger},
url = {https://calibrationledger.com/methodology/},
note = {CC-BY-4.0; machine-readable JSON-LD twin at https://calibrationledger.com/api/methodology.json}
}Plain text
Calibration Ledger (2026). Methodology v1.1 — calibrated accuracy scores for predictive sources. https://calibrationledger.com/methodology/. CC-BY-4.0.Methodology version is independent of site version. The currently-cited version is v1.1, last verified 2026-04-24. Earlier versions are not yet archived (this is the first published revision after v1.0). Future revisions will be tracked in the /changelog/ with stable version-pinned URLs.
#Related
- • /about/ — operator identity, prerequisite phase, Q3 2027 launch gate, kill criterion
- • /for-agents/ — machine-readable reference for LLM crawlers, JSON twin, citation format, license
- • /api/methodology.json — JSON-LD twin of this page (CC-BY-4.0)
- • /changelog/ — methodology + site version history
This methodology is a draft. It will be revised before public launch based on design-partner feedback, academic review, and operator’s own calibration work on ForecastLens.
CC BY 4.0Creative Commons Attribution 4.0 International — attribute to Calibration Ledger, link to calibrationledger.com/methodology/, indicate any changes.
Last verified: 2026-04-24