← Back to Blog

Some rare examples of AIs being underconfident

Excessive cautiousness appears as a failure mode in frontier forecasting agents

One of the questions in the BTF-2 benchmark asked a Claude Opus 4.6 agent, in Oct 2025, whether the 2025 NYC mayoral general election would exceed 1.3 million total ballots. Claude found good evidence and did the math correctly. The primary had already drawn 1.1 million ballots. The historical primary-to-general ratio was 1.22. That gives 1.34 million general-election ballots, above the threshold.

Opus wrote that calculation in its rationale, looked at it, called it "unstable across cycles," and gave a final forecast 25%. The actual general-election turnout was >2.0M, clearing the threshold by >1.5x. So why did Opus get such a bad score on this question? Why does it sometimes do good research, but reach bad conclusions?

The agent computed the right answer and then walked away from it

This example came from expert human forecasters auditing 130 an Opus 4.6 agent's worst calls on BTF-2, analyzed further in our paper. Most of the time, the worry with AI forecasting is overconfidence, especially because of the old forecasting adage, "things don't happen".

We were surprised to see errors in the opposite direction, specifically in Opus 4.6, though we couldn't be sure it didn't also exist in GPT-5.4 and Gemini-3.1-pro agents. The agent does good research, derives the right pathway, names the correct precedent, and then assigns a probability that contradicts its own analysis because it thinks its conclusion is too extreme.

A few other examples from the audit:

Opus was asked whether the UNSC would adopt a ceasefire resolution without a US veto by December 31. It named the right pathway in its rationale ("US sponsors a resolution endorsing its own peace plan, like 2735"), cited the 2735 precedent, and noted Russia's public support for Trump's 20-point plan. Then it gave an 8% chance. On November 17, Resolution 2803 was adopted 13-0-2 through exactly that pathway.

On the Argentine peso, Opus gave an 85% chance that the BCRA rate would depreciate at least 8% by year-end, which did not happen (it was 5%). In its forecast, the depreciation case got seven sourced paragraphs; the election-reversal case, what actually happened, got one bullet. The midterm election was eleven days away with Milei leading. LLA won on October 26, the peso rallied roughly 10% the next day, and the year-end depreciation was about 5%.

On US-Venezuela talks, Opus was asked whether either government would confirm direct bilateral contact by December 31. It found the October 6 diplomatic cutoff, named the reversal pathway, even acknowledged that "escalation itself creates pressure for diplomatic engagement," then gave a 10% chance. Trump called Maduro on November 21. (More on this in how AI takes stated positions as durable commitments.)

In each case, the research was good, but the probability indicates that Claude didn't really believe its own research over its priors.

Maybe this is working as intended? It could be a safety feature, ensuring that Claude doesn't go off the rails when the evidence it gets about what's going on is surprising or unusual. But this makes it much worse as forecaster and in asking for advice about decisions and scenarios you might face.

Worth watching for in any setting where the analysis and the bottom-line conclusion diverge. When they disagree, sometimes the analysis is better than the conclusion. (Which is annoying as it's easy to skip the lengthy analysis and read the conclusion.)

Also see: the paper, the BTF-2 leaderboard, the BTF-2 dataset release, the forecasting API, AI takes people at their word, agents sometimes catastrophize, run agents twice, and the effort paradox.