Opus 4.6 does better research, Gemini 3.1 has better judgment

When new models are released, they tend to be better at every benchmark. People have intuitions for "ChatGPT is better at X" and "Claude is better at Y" but it's hard to show this empirically.

We have some new evidence from forecasting that Gemini 3.1 Pro is actually better than Claude 4.6 Opus at judgment tasks. (Don't have data on Opus 4.7 yet but I don't think it's significantly better on this dimension.) In brief: if you need research, e.g. the agent uses the web, Opus 4.6 is strongly better than Gemini 3.1. But if you have all the evidence you need, and don't need any further research, Gemini 3.1 Pro might be better.

This fits my intuitions, as I've seen Claude Opus 4.6 work very hard on my queries when Gemini 3.1 Pro just returns some low quality answer very quickly based on little evidence. But to the experimental evidence:

We found that a Claude Opus 4.6 agent is the most accurate single-agent forecaster on BTF-2, a benchmark of 1,417 hard forecasting questions about real-world events. We ran an experiment where we used it as an LLM, not an agent, by giving it the forecasting research up-front. An we saw it falls behind Google Gemini 3.1 Pro, even though Claude trounced Gemini in agentic forecasting (e.g. just type your question in and let the agent do the research).

When agents do their own research, Opus wins. When given the same fixed evidence, Gemini wins.

Here are the numbers, from the paper. When frontier models do their own web research on BTF-2 (searching, reading pages, deciding what to dig into), Opus 4.6 leads with a 0.131 Brier score, followed by Gemini 3.1 Pro at 0.143, GPT-5.4 at 0.151, and Grok 4.20 at 0.165.

Then, we gave each of those same four models the same pre-gathered research summaries (compiled using the methodology from Bosse et al., 2026) and asked them to forecast from that fixed evidence. The ranking flipped. Gemini 3.1 Pro reached 0.141, beating Opus 4.6 at 0.153. GPT-5.4 landed at 0.156 and Grok 4.20 at 0.163.

In both cases the difference between Claude 4.6 Opus and Gemini 3.1 Pro were large and statistically significant. When the agent does its own research, Opus wins by 0.012 Brier. When both read the same dossier, Gemini wins by 0.012 Brier. GPT-5.4 and Grok barely moved between conditions, getting slightly worse and slightly better respectively, neither in a meaningful way.

This is, to our knowledge, the first direct evaluation of frontier models that distinguishes in skill in "research" (search and find the right information) and "judgment" (interpret the evidence correctly).

Forecasting tasks provide "calibration" and "refinement" scores, in addition to accuracy, which corroborate this. Opus 4.6, without the ability to search, got much worse calibration and refinement, indicating Opus is good at choosing what to search for, deciding which pages to read, and pulling out the details that matter. When you take that away, Opus loses the advantage it earned through search.

Gemini goes the other direction. Its Brier improved from 0.143 to 0.141 with pre-gathered research, and its refinement score (a measure of how well a forecaster distinguishes questions that will resolve YES from those that will resolve NO) went up. The pre-gathered summaries were apparently at least as good as what Gemini's own agent found, possibly better. What Gemini brings is sharper judgment over fixed evidence. Given the same information, it weighs it more accurately than Opus does.

While forecasting is a specific type of research and judgment task, I think it would be reasonable to generalize from this and say: Opus 4.6 is dramatically better than Gemini-3.1-Pro at figuring out what to search for, and knowing when it has found it. But when a task comes with all the information that might be needed, Gemini 3.1 Pro is up to the task.

I find this a bit surprising because Gemini is made by Google, who should be the best at research. But this does fit my intuitions. When Google makes a model that's better at, well, searching and reaidng pages on the internet, we'll be keen to test it and see if this still holds.

Also see: the full paper, the BTF-2 leaderboard, the BTF-2 dataset release, the BTF-2 dataset on Hugging Face, and FutureSearch's forecasting API. For related findings on how different models respond to different operating conditions, see higher effort can reduce accuracy and effort scaling across model families. For the qualitative counterpart to this decomposition, see measuring one way AIs lack self-awareness.