History doesn't repeat itself as often as LLMs think

Any forecast of the future can only depend on information from the past, obviously. The "problem of induction" goes back to Hume. Generally, good forecasters (human or AI) like to anchor to historical patterns, basically asking "the last few times something like this happened, what was the outcome?".

Triaging some forecast failures from Claude Opus 4.6 agents and other agents, I noticed the LLMs sometimes were weirdly faithful to historical patterns that did not fit the present situation, at least how a human would see it.

For example, one forecasting question was about Iran nuclear sanctions from the EU. The EU Council had amended its Iran nuclear-sanctions instruments every two to three months throughout 2025. February, April, July, September. A Claude Opus 4.6 forecasting agent in the BTF-2 dataset gave 95% odds on at least one more such amendment between October 15 and December 31. In our second run, it gaves 95% again.

But it didn't happen. No qualifying Council act was adopted in Q4 2025. (The next one was in March 2026.) That led to a very bad score, a 95% forecast on something that doesn't happen should only happen 1/20 times.

EU Council Iran sanctions acts spiked in September then stopped — the agent gave 95% on continuation

The pattern Opus identified was legit, so why didn't it apply in this case? The September package was a comprehensive sanctions overhaul timed to a hard external deadline in mid-October. The whole 2025 cadence had been building toward it. Once it landed, the legal trigger was discharged. There was no remaining reason for further activity before year-end.

Opus actually knew this, the key details were in its research: the September package by name, the deadline as the external driver, even an aside that "the massive September 29 package was comprehensive, potentially reducing the immediate need for further action". But it overrode this with: "The only scenario for 'No' would be if the Council takes an unusual multi-month pause on these instruments, which contradicts established patterns."

This felt strange and something I don't think any human would conclude from that information.

Two more examples from the BTF-2 dataset where I spotted this "religious adherence to a historical pattern that no longer applies" from an Opus 4.6 agent:

"Will the official BCRA exchange rate depreciate at least 8% between October 15 and December 31, 2025?" The peso had moved 6.5% in five days, only 1.5 points from the threshold. Opus gave 85%, extrapolating the trajectory. One bullet in the rationale noted a midterm election six days away with the incumbent leading. Standard emerging-markets logic is that a positive election strengthens the currency. The election went the predicted way, the peso rallied roughly 10% the next day, and year-end depreciation was about 5%. Opus had treated panic-driven movement as a stable baseline.

"Will the 2025 NYC mayoral general election reach 1,300,000 total ballots?" No mayoral general had reached that since 2001, and Opus anchored on the 24-year ceiling to give 25%. The ceiling came from an era of uncompetitive races in a city whose closed primary structurally excluded ~1.4 million voters. The 2025 race had broken that pattern: an independent candidate explicitly designed to mobilize the excluded electorate, and a primary that had already matched recent general-election turnout. Actual turnout: 2.2 million.

These cases are intricate, and there are a lot of details other than what I'm mentioning here. But I've seen it often enough, at least more often than the reverse errorm, that I can confidently say this is a failure mode of at least Opus 4.6 forecasting agents. They think history repeats itself more often than it actually does.

This might matter to you in any forecasting task, but also just in analyzing current events, even outside of politics. AI agents might find patterns in their research and anchor to them too strongly, and not think through the "inside view" of how this time may be different. I recommend keeping that in mind.

Also see: the paper, the BTF-2 dataset on Hugging Face, how agents catastrophize outcomes, when agents ignore their own math, why running agents twice improves accuracy, and FutureSearch's forecasting API.