← Back to Blog

Run agents twice for fun and profit

A simple, if expensive, way to improve performance on hard agentic tasks

Now that LLMs are almost exclusively run as agents, and rely on the web or code for information, there is a lot of variance in their outputs. One simple technique for getting a handle on this is running agents twice.

We study this from the perspective of forecasting. A single Claude Opus 4.6 forecasting agent scores 0.130 Brier on 1,367 questions from the BTF-2 benchmark. Run a second Opus 4.6 agent on the same questions and it also scores 0.130. Average those two runs together, throw in a Gemini 3.1 Pro and a GPT-5.4 run, take the mean across all four, and the score drops to 0.125, which is the equivalent of a ~5% closer probability on every question in the set.

Yes, this "wisdom of the crowds" is especially useful in forecasting. But it is a general technique that I think is underused.

Brier scores on 1,367 BTF-2 questions, the mean of four agent runs beats every individual

The idea behind this is simple. Different runs explore different search paths, notice different evidence, weigh competing arguments differently, and land on different probabilities. The errors are partly random. Averaging cancels out the idiosyncratic mistakes while preserving the signal that every run picks up.

This applies not only to web research, but even things like agentic coding. There are just many degrees of freedom in what agents do, so "run it twice" reduces it, as long as you have a way to synthesize the results between the two.

An example

The Brazil COP30 question in BTF-2 is a clean illustration. The benchmark asked, on October 15, whether Brazil's Câmara dos Deputados would approve the long-stalled circular economy bill (PL 1.874/2022) by December 31. The Opus 4.6 agent run 1 spent its 17 search queries on bill numbers and procedural status. Not one query mentioned "COP30," "climate," or "Belém." It gave 30%, anchored on the bill's history of being scheduled and not voted.

On the second run, it happened to broaden one of its searches and surfaced the COP30 connection. Brazil was hosting COP30 in November, twelve days after the bill in fact passed. The host-country incentive to ship a flagship environmental bill before the summit was the load-bearing reason for the vote. Run 2 still gave 35%, which was a better forecast. The bill passed on October 29. (I wrote about the deeper failure of modeling politician's motivations separately, and how a better forecast on this question was possible.)

What about cost?

A single Opus 4.6 agent run, in this harness, costs about $0.55 per question, based on the number of web searches and page reads it does. Whether that tradeoff is worth it depends on what you are using the agents for, but also how you pay for it.

If you're on a subscription with room to spare, running two agents in two browser windows for some research is very easy. Or having a coding task done twice in isolation (git worktrees help here) may not cost you anything at all. But you do need a way to have a reviewer look at the two solutions and figure out if one found something important the other one missed.

It is easy to forget when you are building agentic systems, because the natural instinct is to make each individual run better. But if you have not tried the dumb thing first, running it twice, you are leaving the easiest, if not the cheapest, improvement on the table.

This kind of ensembling was part of what led to the best forecaster in our paper, which had a 0.011 Brier advantage over a single Opus 4.6 agent, which is pretty substantial. Though that wasn't the only technique used.

The FutureSearch app uses multiple agents with slightly different approaches in a lot of cases. It helps, and is often worth the cost.

Also see: the paper, the BTF-2 leaderboard, the BTF-2 dataset release, higher effort settings in LLMs can reduce accuracy, and more reasoning tokens help Claude, but not GPT or Gemini.