Comprehensive breakdown of performance across 13 domains and temporal metrics.
Performance across 13 domains: Blockchain (Bitcoin, Cardano, Iota, Monero), Social Sciences (Economics, Law,
Politics, History), Applied (Quant, Travel, Workplace, Genealogy), and STEM (HSM).
Best in bold
Second best underlined
| Domain | Contriever | BGE | Inst-L | SBERT | Qwen | GritLM | E5 | SFR | Rader | ReasonIR | DiVeR | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Blockchain | ||||||||||||
| Bitcoin | 13.3 | 14.4 | 15.7 | 14.3 | 11.4 | 19.1 | 16.3 | 17.6 | 14.9 | 16.3 | 17.4 | 15.5 |
| Cardano | 12.1 | 13.1 | 14.6 | 21.4 | 20.6 | 21.7 | 35.7 | 28.1 | 18.6 | 22.9 | 29.3 | 21.6 |
| Iota | 38.3 | 36.1 | 34.3 | 33.2 | 28.6 | 36.6 | 41.7 | 37.1 | 19.2 | 41.7 | 38.2 | 35.0 |
| Monero | 9.9 | 14.5 | 16.9 | 15.1 | 11.0 | 14.7 | 20.0 | 23.7 | 21.0 | 19.6 | 20.3 | 16.9 |
| Social Sciences | ||||||||||||
| Economics | 16.3 | 12.6 | 17.5 | 15.3 | 17.1 | 17.2 | 25.0 | 21.9 | 22.7 | 20.0 | 27.8 | 19.4 |
| History | 26.5 | 27.4 | 28.5 | 28.7 | 25.6 | 27.3 | 28.7 | 32.4 | 25.8 | 34.3 | 34.5 | 28.9 |
| Law | 28.1 | 31.9 | 37.3 | 33.8 | 32.0 | 38.3 | 34.0 | 40.8 | 33.5 | 37.9 | 40.4 | 35.3 |
| Politics | 31.6 | 28.2 | 32.6 | 34.6 | 38.1 | 41.4 | 47.9 | 44.9 | 32.4 | 35.4 | 45.5 | 37.5 |
| Applied | ||||||||||||
| Genealogy | 24.9 | 22.0 | 24.6 | 23.5 | 25.3 | 26.9 | 33.5 | 31.7 | 18.7 | 30.3 | 35.6 | 27.0 |
| Quant | 11.1 | 11.7 | 14.6 | 15.7 | 12.7 | 21.6 | 13.8 | 16.8 | 27.8 | 19.5 | 27.2 | 17.5 |
| Travel | 23.7 | 23.8 | 25.0 | 27.3 | 22.0 | 25.0 | 28.3 | 29.7 | 26.1 | 21.4 | 26.8 | 25.4 |
| Workplace | 23.9 | 27.2 | 36.2 | 34.6 | 30.3 | 30.8 | 32.9 | 31.6 | 36.6 | 30.0 | 42.6 | 32.4 |
| STEM | ||||||||||||
| HSM | 18.9 | 23.2 | 24.4 | 26.1 | 21.3 | 33.4 | 37.7 | 33.5 | 16.9 | 24.7 | 31.0 | 26.5 |
| Average | 21.4 | 22.0 | 24.8 | 24.9 | 22.8 | 27.2 | 30.4 | 30.0 | 24.2 | 27.2 | 32.0 | 26.2 |
TP@10: Temporal Precision; TR@10: Temporal Relevance; TC@10: Temporal Coverage; NDCG|FC@10: NDCG conditioned on full coverage.
| Model | Type | TP@10 | TR@10 | TC@10 | NDCG|FC@10 |
|---|---|---|---|---|---|
| Sparse Model | |||||
| BM25 | Sparse | 24.0 | 11.1 | 32.8 | 22.5 |
| Open-sourced Models (<1B) | |||||
| BGE | Dense <1B | 53.7 | 34.0 | 66.1 | 25.3 |
| Contriever | Dense <1B | 49.6 | 30.3 | 64.1 | 26.2 |
| Inst-L | Dense <1B | 53.1 | 33.8 | 66.9 | 25.8 |
| SBERT | Dense <1B | 54.1 | 35.0 | 67.2 | 26.9 |
| Open-sourced Models (>1B) | |||||
| E5 | Dense >1B | 53.5 | 31.1 | 63.2 | 33.4 |
| GritLM | Dense >1B | 53.8 | 35.0 | 69.1 | 27.2 |
| Qwen | Dense >1B | 46.2 | 28.8 | 61.0 | 29.9 |
| Reasoning Models (>1B) | |||||
| DiVeR | Reasoning | 62.0 | 41.3 | 71.4 | 32.5 |
| Rader | Reasoning | 50.6 | 32.3 | 66.1 | 26.1 |
| ReasonIR | Reasoning | 57.4 | 38.2 | 72.4 | 28.8 |
TEMPO queries are classified into 9 distinct temporal reasoning tasks:
Average number of retrieval steps required per domain (proxy for difficulty).
| Domain | Avg. Steps | Domain | Avg. Steps |
|---|---|---|---|
| Genealogy | 3.78 | HSM | 3.25 |
| History | 3.42 | Law | 3.23 |
| Politics | 3.35 | Iota | 3.20 |
| Economics | 3.08 | Travel | 3.11 |
| Bitcoin | 2.93 | Cardano | 2.84 |
| Monero | 2.72 | Quant | 2.68 |
| Workplace | 2.42 | Overall | 3.28 |