Run agents twice for fun and profit

Now that LLMs are almost exclusively run as agents, and rely on the web or code for information, there is a lot of variance in their outputs. One simple technique for reducing variance is running agents twice.

We study this from the perspective of forecasting. A single Claude Opus 4.6 forecasting agent scores 0.130 Brier on 1,367 questions from the BTF-2 benchmark. We ran a second Opus 4.6 agent on the same questions and it got exactly the same score of 0.130. When we averaged those two runs together, along with a Gemini 3.1 Pro and a GPT-5.4 run, took the mean across all four, the score improves to 0.125, which is the equivalent of a ~5% closer probability on every question in the set.

Yes, this "wisdom of the crowds" is especially useful in forecasting. But it is a general technique that I think is underused.

Brier scores on 1,367 BTF-2 questions, the mean of four agent runs beats every individual

Different agent runs explore different search paths, notice different evidence, weigh competing arguments differently, and land on different probabilities. The errors are partly random. Averaging cancels out the idiosyncratic mistakes.

This is maybe clearest in web research, but it also applies to agentic coding. There are just a lot of degrees of freedom in what agents do, so "run it twice" reduces it, as long as you have a way to synthesize the results between the two runs.

Consider one of our forecasting questions from the BTF-2 dataset. The agent was asked, on October 15, whether Brazil's Câmara dos Deputados would approve a long-stalled green bill by December 31. They did pass the bill, primarily due to COP30.

The Opus 4.6 agent run 1 spent its 17 search queries on bill numbers and procedural status. It didn't query "COP30," "climate," or "Belém" at any point, though we know it should have. It forecast 30% likely, anchored on the bill's history of being scheduled and not voted.

On the second run, it happened to broaden one of its searches and surfaced the COP30 connection. Brazil was hosting COP30 in November, and needed to bring a win to the event. The second agent picked up on this incentive. It gave 35%, which was a better forecast. The bill passed on October 29. (I wrote about the deeper failure of modeling politician's motivations separately, and how a better forecast on this question was possible.)

What about cost?

A single Opus 4.6 agent run, in our agent harness, costs about $0.55 per question, based on the number of web searches and page reads it does. Whether that tradeoff is worth it depends on what you are using the agents for, but also how you pay for it.

If you're on a subscription with room to spare, running two agents in two browser windows for some research is very easy. Or having a coding task done twice in isolation (git worktrees help here) may not cost you anything at all. But you do need a way to have a reviewer (human or agent) look at the two solutions and figure out if one found something important the other one missed.

Running it twice may be crude, but it can be more effective than actually improving the agent or scaling up its resources. I would try the dumb thing first before investing more in actual quality improvements, if it's not cost-prohibitive.

This kind of ensembling was part of what led to the best forecaster in our paper, which had a 0.011 Brier advantage over a single Opus 4.6 agent, which is pretty substantial. (Though that wasn't the only technique used.)

The FutureSearch app uses multiple agents with slightly different approaches in a lot of cases. It helps, and whether it's worth the cost is up to the user.

Also see: the paper, the BTF-2 leaderboard, the BTF-2 dataset on Hugging Face, higher effort settings in LLMs can reduce accuracy, and more reasoning tokens help Claude, but not GPT or Gemini.