Honest Ceilings
The clearest way to be trusted on analytics accuracy is to be the first one to say what we can't claim. This page is the receipt for that posture — locked numbers on what the model does, and an explicit list of where it falls short.
1. What we can claim
These numbers are computed from the calibrated walk-forward backtest of the currently-shipped model (model-backtest.json) — the exact same live source that drives /methodology and /track-record, so the three surfaces can never disagree — together with SHA256-stamped analytics reports in our public results table.
| Model version | current |
| Brier score (all-time, lower is better) | 0.58 |
| 1X2 accuracy (all-time) | 53% |
| Log-loss (all-time) | 0.00 |
| Graded fixture count | 1,520,107 |
On the subset of fixtures that overlap with FiveThirtyEight's published SPI projections we beat their Brier and log-loss. The comparison is dataset-aligned, point-by-point, and re-runnable — see the head-to-head page for the breakdown: /reports/benchmark.
2. What we cannot claim
- We do not beat Pinnacle's closing line. Closing lines from a sharp, high-limit market are the single hardest benchmark in soccer probability modelling. They are informed by SPX-class statistical models plus live information from deep market liquidity (sharps, syndicates, late-breaking news, lineup leaks). We have no realistic claim of out-predicting that aggregate. Anyone who tells you their public model consistently beats Pinnacle close should be asked to lock the reports pre-kickoff with a hash.
- Accuracy in low-data leagues is notably worse. The headline accuracy is league-weighted, and our worst buckets sit several points below it.
- We don't catch upsets. True upsets are by definition unpredictable from historical patterns. The model returned ~5% on Argentina-Saudi Arabia in 2022; that is not a bug, that is what 5% means.
- We are not a betting product.No expected value, no stake-sizing, no “picks”. Probabilities only.
3. Known structural limitations
- Bookmaker odds are not used as model input. This is a deliberate choice (the SPX commitment). Using closing odds as a feature would inflate every backtest number while being structurally impossible to reproduce live (the closing line doesn't exist 24h before kickoff). Our reports are generated from team-level signal and locked before the market converges — that is the entire integrity story.
- In-play refresh coverage varies by league. For top-coverage competitions we issue half-time and late-phase updates. Lower divisions and some continental competitions don't carry the live data feeds required, so we publish the pre-kickoff analytics report only.
- Team strength is the unit of measurement, not player. Lineups and injury context feed in, but the model is team-level today. Finer-grained player modelling is on the roadmap.
- Some contextual signals are still maturing. Where a contextual signal is sparse or unavailable, the model relies on the rest of the stack rather than guessing. These gaps close as data coverage improves.
4. How we calibrate honesty
These ceilings are not vibes — they are checkable. Three live surfaces hold us to them:
- Calibration plot: /methodology renders predicted-vs-actual rates with 95% Wilson CI bars. If we say 70%, the dot for the 70% bucket should sit on the diagonal. If it doesn't, the page shows you it doesn't.
- Dated reports in the repo: every retrain writes a markdown report to docs/backtest-reports/. The headline numbers above come from the live calibrated backtest of the shipped model; the per-league worst-buckets are pulled from the newest dated report that matches the running model version. You can grep the file in the repo to verify the table.
Methodology: /methodology · Benchmark vs 538: /reports/benchmark.