TEMPO: A Realistic Multi-Domain Benchmark for Temporal Reasoning-Intensive Retrieval

Abdelrahman Abdallah1, Mohammed Ali1, Muhammad Abdul-Mageed2, Adam Jatowt1
1University of Innsbruck, 2University of British Columbia,

Why a new benchmark?

Existing temporal QA benchmarks focus on fact-seeking queries from news corpora, while reasoning-intensive retrieval benchmarks lack temporal grounding. Real-world information needs often require reasoning about temporal evolution and synthesizing evidence across time periods. We introduce TEMPO to better benchmark retrieval on such challenging and realistic scenarios.

TEMPO

TEMPO is the first benchmark combining temporal reasoning with reasoning-intensive retrieval across 13 domains. It features 1,730 complex queries, 3,976 step-wise retrieval plans mapped to gold documents, and novel temporal metrics including Temporal Coverage@k and Temporal Precision@k.

TEMPO overview diagram.

Leaderboard submission

If you would like to submit your results to the leaderboard, please open a pull request or issue on our GitHub repository.

Have Questions?

Contact the authors or open an issue on GitHub.

Leaderboard

Simplified leaderboard showing Average NDCG@10, Temporal Precision (TP), and Temporal Coverage (TC).

Rank Model Type Avg NDCG@10 TP@10 TC@10
🥇 1 DiVeR Reasoning 32.0 62.0 71.4
🥈 2 E5 Dense >1B 30.4 53.5 63.2
🥉 3 SFR Dense >1B 30.0 - -
4 GritLM Dense >1B 27.2 53.8 69.1
5 ReasonIR Reasoning 27.2 57.4 72.4
6 SBERT Dense <1B 24.9 54.1 67.2
7 Inst-L Dense <1B 24.8 53.1 66.9
8 Rader Reasoning 24.2 50.6 66.1
9 Qwen Dense >1B 22.8 46.2 61.0
10 BGE Dense <1B 22.0 53.7 66.1
11 Contriever Dense <1B 21.4 49.6 64.1
12 BM25 Sparse 10.8 24.0 32.8

TEMPO Overview

What Makes TEMPO Different

Temporal Reasoning

Queries require temporal evolution, cross-period comparison, and temporal dependency resolution across pre-1900 to 2020+.

Step-Wise Planning

Each query includes a retrieval plan with step-specific gold documents for multi-hop evaluation across time periods.

Temporal Metrics

New metrics measure temporal relevance and coverage so systems must retrieve evidence aligned with required time periods.

Domains

Blockchain (Bitcoin, Cardano, IOTA, Monero), Social Sciences (Economics, Law, Politics, History), Applied (Quantitative Finance, Travel, Workplace, Genealogy), and STEM (History of Science and Mathematics).

Tasks

Task 1: Query to Documents

Retrieve temporally relevant documents for each query, ensuring evidence aligns with required time periods and supports cross-period analysis.

Task 2: Query to Step to Documents

Multi-step retrieval where each step targets a temporal slice or reasoning sub-goal, with gold documents mapped to each step.

Temporal Evaluation Metrics

Traditional IR metrics miss temporal grounding. TEMPO introduces four temporal metrics computed with LLM-as-judge labels for temporal relevance and period coverage.

TP@k (Temporal Precision)

Position-weighted precision of temporally relevant documents.

TR@k (Temporal Relevance)

Fraction of top-k results judged temporally relevant.

TC@k (Temporal Coverage)

Coverage of required time periods in top-k results.

NDCG|FC@k

NDCG computed only when full temporal coverage is achieved.

Results Snapshot

Best Overall

32.0 NDCG@10

DiVeR (avg. across 13 domains)

Temporal Coverage

71.4% TC@10

DiVeR, top-k results

Dataset Scale

1.65M Docs

Across 13 domains

Reasoning-enhanced retrievers lead performance, but results remain far from saturated. Dense retrievers improve over BM25, yet cross-period queries remain challenging, highlighting the need for temporal grounding in retrieval and RAG systems.

Resources

Code, data, and evaluation scripts are available below.