Detailed Leaderboard

Comprehensive breakdown of performance across 13 domains and temporal metrics.

NDCG@10 Performance Breakdown

Performance across 13 domains: Blockchain (Bitcoin, Cardano, Iota, Monero), Social Sciences (Economics, Law, Politics, History), Applied (Quant, Travel, Workplace, Genealogy), and STEM (HSM).
Best in bold Second best underlined

Domain Contriever BGE Inst-L SBERT Qwen GritLM E5 SFR Rader ReasonIR DiVeR Avg.
Blockchain
Bitcoin 13.3 14.4 15.7 14.3 11.4 19.1 16.3 17.6 14.9 16.3 17.4 15.5
Cardano 12.1 13.1 14.6 21.4 20.6 21.7 35.7 28.1 18.6 22.9 29.3 21.6
Iota 38.3 36.1 34.3 33.2 28.6 36.6 41.7 37.1 19.2 41.7 38.2 35.0
Monero 9.9 14.5 16.9 15.1 11.0 14.7 20.0 23.7 21.0 19.6 20.3 16.9
Social Sciences
Economics 16.3 12.6 17.5 15.3 17.1 17.2 25.0 21.9 22.7 20.0 27.8 19.4
History 26.5 27.4 28.5 28.7 25.6 27.3 28.7 32.4 25.8 34.3 34.5 28.9
Law 28.1 31.9 37.3 33.8 32.0 38.3 34.0 40.8 33.5 37.9 40.4 35.3
Politics 31.6 28.2 32.6 34.6 38.1 41.4 47.9 44.9 32.4 35.4 45.5 37.5
Applied
Genealogy 24.9 22.0 24.6 23.5 25.3 26.9 33.5 31.7 18.7 30.3 35.6 27.0
Quant 11.1 11.7 14.6 15.7 12.7 21.6 13.8 16.8 27.8 19.5 27.2 17.5
Travel 23.7 23.8 25.0 27.3 22.0 25.0 28.3 29.7 26.1 21.4 26.8 25.4
Workplace 23.9 27.2 36.2 34.6 30.3 30.8 32.9 31.6 36.6 30.0 42.6 32.4
STEM
HSM 18.9 23.2 24.4 26.1 21.3 33.4 37.7 33.5 16.9 24.7 31.0 26.5
Average 21.4 22.0 24.8 24.9 22.8 27.2 30.4 30.0 24.2 27.2 32.0 26.2

Temporal Evaluation Metrics

TP@10: Temporal Precision; TR@10: Temporal Relevance; TC@10: Temporal Coverage; NDCG|FC@10: NDCG conditioned on full coverage.

Model Type TP@10 TR@10 TC@10 NDCG|FC@10
Sparse Model
BM25 Sparse 24.0 11.1 32.8 22.5
Open-sourced Models (<1B)
BGE Dense <1B 53.7 34.0 66.1 25.3
Contriever Dense <1B 49.6 30.3 64.1 26.2
Inst-L Dense <1B 53.1 33.8 66.9 25.8
SBERT Dense <1B 54.1 35.0 67.2 26.9
Open-sourced Models (>1B)
E5 Dense >1B 53.5 31.1 63.2 33.4
GritLM Dense >1B 53.8 35.0 69.1 27.2
Qwen Dense >1B 46.2 28.8 61.0 29.9
Reasoning Models (>1B)
DiVeR Reasoning 62.0 41.3 71.4 32.5
Rader Reasoning 50.6 32.3 66.1 26.1
ReasonIR Reasoning 57.4 38.2 72.4 28.8

Dataset Insights

Temporal Reasoning Classes

TEMPO queries are classified into 9 distinct temporal reasoning tasks:

  1. Event Analysis & Localization (EAL): Pinpointing event times.
  2. Time Period Contextualization (TPC): Situating phenomena in history.
  3. Origins & Evolution Comparative (OEC): Tracking evolution over time.
  4. Trends & Cross-Period Comparison (TCP): Comparing states across eras.
  5. Event Verification & Authenticity (EVA): Verifying claims/dates.
  6. Materials & Artifacts Provenance (MAP): Dating physical items.
  7. Sources & Methods Documentation (SMD): Analyzing historical methods.
  8. Causation Analysis (CAU): Temporal cause-effect.
  9. Historical Attribution & Context (HAC): Correctly attributing eras.

Retrieval Plan Complexity

Average number of retrieval steps required per domain (proxy for difficulty).

Domain Avg. Steps Domain Avg. Steps
Genealogy 3.78 HSM 3.25
History 3.42 Law 3.23
Politics 3.35 Iota 3.20
Economics 3.08 Travel 3.11
Bitcoin 2.93 Cardano 2.84
Monero 2.72 Quant 2.68
Workplace 2.42 Overall 3.28