Releasing the BTF-2 Dataset: A Scaled, Difficult, Reproducible Forecasting Benchmark

Today we are releasing the dataset behind BTF-2, a benchmark for evaluating how well AI agents can forecast real-world events. The dataset contains 1,417 binary forecasting questions covering Oct - Dec 2025, each with detailed resolution criteria, background context, pages of relevant research, a state-of-the-art forecast, and what actually happened. The benchmark leaderboard is live at evals.futuresearch.ai, and the dataset is available on Hugging Face.

The accompanying paper, Evaluating Strategic Reasoning in Forecasting Agents, describes the methodology and our findings on how frontier AI agents reason about uncertain future events.

As far as we know, this is the first scaled dataset of forecast questions that includes (a) enough research to forecast them without using the web, and (b) what a high quality forecast looks like.

The BTF-2 Dataset

Each of the 1,417 rows is a question about the future that was written before the outcome was known, and the research is based on offline web pages that were scraped ahead of time, guaranteeing no leakage of the future.

The questions are binary yes/no questions. They primarily focus on geopolitics, regulation, and economics, but have some other things from topical news in late 2025. Some examples: whether Brazil's Câmara dos Deputados would pass a circular-economy bill before COP30, whether the UN Security Council would adopt a Gaza ceasefire resolution without a US veto, whether Pakistan's 2025 polio cases would reach 35, whether "Only Murders in the Building" would receive four Golden Globe nominations.

Our previous Jan 2026 paper showed some evidence on how these questions are significantly more challenging than most forecasting questions. Very few of them can be adequately forecast by just extrapolating some historical trend. They require weighing conflicting evidence, modeling institutional behavior, and reasoning about uncertainty, the kind of analytical judgment calls that intelligence analysts and policy researchers make.

That, and the April 2026 paper show empirically that there are enough questions, and the questions are hard enough, that it can statistically tell the difference between excellent forecasters.

The twelve columns per row:

question and question_id: the question text and a unique identifier.
resolution_criteria: the full specification of what counts as YES and NO, including sources, time windows, and edge-case handling. These typically run several hundred words.
background: context a forecaster needs that is not part of the resolution rule.
research_summary: the full research dossier gathered by the state-of-the-art forecasting agent during its research window. These average roughly 12,000 characters per question, about 8 to 10 pages of analytical reading. 1,412 of the 1,417 rows have substantive research summaries.
date_cutoff_start and date_cutoff_end: the window during which the agent was allowed to search the web.
present_date: the simulated date the agent believed it was operating on.
resolution: ground truth. 1.0 for YES, 0.0 for NO. Of the 1,417 questions, 385 resolved YES (27.3%) and 1,027 resolved NO.
resolution_explanation: a written explanation of why the question resolved the way it did, with citations.
sota_forecast_probability: the state-of-the-art agent's forecast, from 0 to 100.
sota_summary_rationale: the agent's written reasoning, including strongest arguments for and against, key uncertainties, and the final probability.

The key design challenge in any forecasting benchmark is preventing data contamination. If an agent can search the web for information published after the outcome is known, the benchmark measures retrieval, not reasoning. We solve this with RetroSearch, a frozen web-search system. Each question has a defined research window (the date_cutoff_start to date_cutoff_end columns). During evaluation, agents search and read pages through RetroSearch, which returns only content that existed before the cutoff date. The agent cannot see anything published after the window closes. This makes the benchmark reproducible: any new agent evaluated on these questions faces the same information environment the original agents faced. The full methodology is in Section 2 of the paper.

Please email us at evals@futuresearch.ai to get access to RetroSearch. But you can immediately run forecasting approaches without it, since each question comes with detailed research from high quality research agents taken at the time the question was asked. It is self-contained.

Findings

We are writing up the findings from BTF-2 as a series of blog and research posts:

Measuring one way AIs lack self-awareness. The SOTA forecaster's rationales include pre-mortem analysis, alternative perspectives, and wildcards reasoning four to nine times more often than frontier agents do.
Opus does better research, Gemini has better judgment. When agents do their own research Opus 4.6 wins; when handed the same pre-gathered evidence Gemini 3.1 Pro wins. Implications for RAG and multi-agent pipelines.
Run agents twice for fun and profit. The simplest single ingredient in the SOTA forecaster is running the same agent more than once and averaging.
AI takes people at their word. Frontier agents read rhetoric as commitment. Public statements get treated as the actor's settled position, not as moves in a strategic sequence.
Claude can miss the motives of politicians. Agents catalog facts accurately but fail to ask "why now?" The forcing function (a host-country deadline, a statute's effective date) is visible in the record but never connected to the actor's incentive to move.
Agents sometimes catastrophize. When a question covers a spectrum of severities, agents model the dramatic end and forget the question resolves on the minimum-sufficient version.
History doesn't repeat itself as often as LLMs think. Agents extrapolate historical base rates without checking whether the process producing them is still active.
Some rare examples of AIs being underconfident. The rarer pattern where an agent computes the right answer, names the right pathway, and then assigns a probability that contradicts its own analysis.

The case-study evidence sits in the dataset itself. The research summaries and rationales for the questions where agents were most wrong are detailed enough to trace exactly where reasoning broke down.

The questions are hard. The mean SOTA forecast is 29.2%, and the base rate is 27.3% YES. Many questions sit in the 20% to 40% range where calibration matters most and where the difference between a good and mediocre forecast is largest. They span domestic US politics, international diplomacy, trade regulation, monetary policy, public health, entertainment, sports, climate, and technology, across dozens of countries. No single domain expertise is sufficient.

The dataset is self-contained. Each row includes the question, resolution criteria, ground truth, the SOTA agent's research and reasoning, and a resolution explanation. No external data is required to understand what was asked, what the agent concluded, or what actually happened.

The paper describes the full methodology, benchmark construction, and findings. The leaderboard stays current as new agents are evaluated. The dataset is available on Hugging Face.

Also see: FutureSearch's forecasting API uses the same state-of-the-art agent evaluated in this benchmark. Our earlier work on forecasting benchmarks includes Bench to the Future (Wildman et al., 2025) and Automating Forecasting Question Generation (Bosse et al., 2026).