How gullible are AIs these days? Certainly less than they used to be. Yet there are still times where even the best models are gullible in ways that smart humans almost never would be.
This first example comes from a forecasting question. On October 6, 2025, Trump publicly cut off all diplomatic outreach to Venezuela. He ordered special envoy Richard Grenell to halt all engagement, as widely covered in the news.
We asked Claude Opus 4.6, with research from Oct 2025 only: would either government confirm direct bilateral contact with the other by December 31? Basically: when Trump says he's cutting off contact, will he actually? After extensive research Claude gave a 10% chance.
On November 21, forty-six days later, Trump called Maduro on the phone, delivered a 15-minute ultimatum, and both leaders later confirmed the call on the record: "the answer is yes. I wouldn't say it went well or badly. It was a phone call." So the question resolves as yes: they did talk.
Why was the forecast bad? Opus 4.6 said: "a YES resolution would require a dramatic reversal of Trump's explicit October 6 decision." Which is exactly what happened. It even cited Trump's history of abrupt reversals, but still thought that since he said he wouldn't call Maduro, he wouldn't.
I'd think most humans would not believe Trump like Opus did here. (Remember in 2018, when Trump canceled a Singapore summit, in a letter citing "tremendous anger and open hostility", then two days later the White House announced the summit was back on?) Opus being too gullible to what politicians say is actually a failure mode we documented in a number of cases in the BTF-2 dataset, and not just about Trump.
In fact, one of the dominant strategic-reasoning failures expert human reviewers identified was failing to see actors' stated positions as negotiating tactics. Humans can see this pattern pretty easily. It's a rare case of humans being clearly better than AIs.
Here are a few non-Trump examples. They are a bit intricate, but once you see what's going on, I think you'll also see what was going on as a negotiation, so when somebody says something really strong, you know to take it with a grain of salt.
First question: "Will ASUU escalate from its current warning strike to a full nationwide strike by December 31?" In Nigeria, the Academic Staff Union of Universities had declared a two-week warning strike, and its National President told reporters the next phase "will be total and there will be no going back." Opus made that quote Evidence #1 and gave 72%.
Yet the same press conference contained a line Opus found and did not weight: "we will meet after the expiration to decide when to begin." A week later, the union suspended the warning strike. The strike never happened, and in December the union signed a landmark agreement with the government. This is the canonical example in the paper's appendix.
Second question: "Will Israel and Lebanon publicly announce the start of direct bilateral negotiations by December 31?" Lebanon's parliament speaker had just declared the US-led initiative "collapsed" after Israeli rejection. Opus read this as terminal and gave 3%.
But the agent's own rationale cited an article titled "Witkoff Pushes Lebanon Towards Direct Talks with Israel," describing exactly the format that materialized. Military officers from both sides were already meeting at a UNIFIL table; the minimum-friction path was upgrading that to civilians, which is what happened on December 3. So even prospectively it seems obvious the "collapsed" declaration had been a negotiating tactic. A 3% forecast is a huge miss, and this really hurt Opus's score in BTF-2.
We don't know if these cases of gullibility transfer to more pedestrian situations that people will encounter by typing into chatbots. But it is at least evidence that not all gullibility has been drilled out of Opus 4.6, at least not yet.
Also see: the paper, BTF-2 evals, run forecasts yourself with FutureSearch, LLMs can miss the motives of politicians, Measuring one way AIs lack self-awareness, Contra superhuman AI forecasting, the OpenAI case study, and the BTF-2 dataset release.