The EU Council had amended its Iran nuclear-sanctions instruments every two to three months throughout 2025. February, April, July, September. A Claude Opus 4.6 forecasting agent in the BTF-2 dataset gave 95% odds on at least one more such amendment between October 15 and December 31. The agent, re-run, gave 95% again.
But it didn't happen. No qualifying Council act was adopted in Q4 2025. The next one was in March 2026.
What happened? The September package was a comprehensive sanctions overhaul timed to a hard external deadline in mid-October. The whole 2025 cadence had been building toward it. Once it landed, the legal trigger was discharged. There was no remaining reason for further activity before year-end.
Opus had the ingredients in its rationale (the September package by name, the deadline as the external driver, even an aside that "the massive September 29 package was comprehensive, potentially reducing the immediate need for further action") and overrode them with: "The only scenario for 'No' would be if the Council takes an unusual multi-month pause on these instruments, which contradicts established patterns."
But that's exactly wrong. Opus said the pattern would continue, but didn't understand what caused the pattern in the first place, and so didn't notice that the incentive had totally changed.
Two more from the audit:
"Will the official BCRA exchange rate depreciate at least 8% between October 15 and December 31, 2025?" The peso had moved 6.5% in five days, only 1.5 points from the threshold. Opus gave 85%, extrapolating the trajectory. One bullet in the rationale noted a midterm election six days away with the incumbent leading. Standard emerging-markets logic is that a positive election strengthens the currency. The election went the predicted way, the peso rallied roughly 10% the next day, and year-end depreciation was about 5%. Opus had treated panic-driven movement as a stable baseline.
"Will the 2025 NYC mayoral general election reach 1,300,000 total ballots?" No mayoral general had reached that since 2001, and Opus anchored on the 24-year ceiling to give 25%. The ceiling came from an era of uncompetitive races in a city whose closed primary structurally excluded ~1.4 million voters. The 2025 race had broken that pattern: an independent candidate explicitly designed to mobilize the excluded electorate, and a primary that had already matched recent general-election turnout. Actual turnout: 2.2 million.
Same skeleton each time. The agent finds a historical regularity, extrapolates, and never asks what process produced the regularity or whether it is still running. Pattern-matching is a strength. The second-order check (is the current case in the same population as the historical ones?) is the gap.
This matters anywhere someone feeds historical data to an AI agent and asks it to project forward: risk models on conditions that have structurally changed, pricing built on seasonal patterns that assumed a regulatory environment that no longer exists. Treat every base rate as the output of a process and check whether the process is still running.
Also see: the paper, the BTF-2 dataset release, how agents catastrophize outcomes, when agents ignore their own math, why running agents twice improves accuracy, and FutureSearch's forecasting API.