The newly released BTF-2 benchmark shows certain agents being better forecasters than others. What accounts for the difference? Here, I want to focus on what is sometimes called "unknown unknowns", or the ability of an agent to deal with its own uncertainty. This is pretty easy to identify in forecasting tasks, because it's impossible to have all the information you'd want.
The best forecasting agent we studied, before it commits to a number, tends to run a "pre-mortem", imagining the forecast turned out wrong and asking why. It considers alternative perspectives, asking how someone with different priors would read the same evidence. It names wildcards, the things that could surprise it. When Tom Liptay and I scored every rationale on the BTF-2 benchmark against the dimensions human superforecasters use, those three stood out as the single biggest reasoning-style gap we measured.
The frontier agents, Claude Opus 4.6, GPT-5.4, and Google Gemini 3.1 Pro, rarely explicitly reason in this way, and their performance is much worse. (We don't know if this is causal.)
This post is the first in a series on findings from Evaluating Strategic Reasoning in Forecasting Agents (Liptay, Schwarz, Poyiadzi, Wildman, and Bosse, 2026). The dataset release is already live.
We used Tetlock's CHAMPS KNOW framework (Tetlock and Gardner, 2015), a set of 10 dimensions developed in the Good Judgment Project that capture how the best human forecasters think. A Gemini 3.1 Pro agent ranked each dimension's prominence in every forecast rationale, across 1,367 questions. The resulting Table 7 in the paper show how the three dimensions related to "unknown unknowns" are the main difference between the best forecaster and the rest.
As we wrote in the paper: "All three are epistemic. The primary gap is awareness of uncertainty and the limits of one's knowledge."
Unknown unknowns
Three of the ten CHAMPS KNOW dimensions are about epistemic self-awareness. The SOTA agent's rationales on BTF-2 hit them visibly:
- Pre/Post-mortem: the agent enumerates ways its own forecast could be wrong. For example, on the question of whether Congress would enact a federal CR with an expiration after November 21, the SOTA rationale (forecast 84%, resolved YES) included an explicit "Strongest Arguments for No" list naming three concrete pathways to its own failure, the first being "A historic, multi-month shutdown: ... If no compromise is reached, the shutdown could theoretically persist continuously through December 31, meaning no CR is enacted at all." Top-3 frequency: SOTA 37.8%, Opus 4.6 9.5%, GPT-5.4 6.8%, Gemini 3.1 Pro 4.3%.
- Other Perspectives: showing how different priors would read the same evidence. The SOTA forecaster does this structurally, with the "Strongest Arguments for Yes" and "Strongest Arguments for No" sections in the rationales, which you can see for yourself in the BTF-2 dataset linked in the announcement Top-3 frequency: SOTA 20.3%, Opus 5.1%, GPT-5.4 1.6%, Gemini 1.7%.
- Wildcards: events outside any trend line. From the same CR question: "The administration's willingness to tolerate a prolonged shutdown to reshape the federal bureaucracy is a major wildcard." From a question on the WTA year-end #1 ranking: "the highly improbable scenario where Świątek suddenly reverses her break, takes a last-minute wildcard into a remaining WTA 250 or 500 event, wins it, and goes undefeated at the WTA Finals, while Sabalenka fails to win a single match." Top-3 frequency: SOTA 2.9%, Opus 0.7%, GPT-5.4 0.3%, Gemini 0.7%.
Sum those three shows a pretty large gap between the SOTA forecaster's rationales at 61%. The Claude Opus 4.6 agent is at 15%, the GPT-5.4 agent at 9%, and the Gemini 3.1 Pro agent at 7%.
Also see the full paper, the BTF-2 benchmark and leaderboard, the BTF-2 dataset release, and FutureSearch's forecasting API.