Higher effort settings in LLMs can reduce accuracy

You'd expect that cranking up the reasoning effort on an AI research agent would yield better results. More thinking surely translates to better research? We ran every major frontier model through Deep Research Bench at multiple effort levels and found the opposite: for three out of four top models, higher effort produced equal or worse scores while costing significantly more.

We evaluated each model at multiple effort levels on 150+ real web research tasks—dataset discovery, fact verification, numerical extraction—and tracked accuracy, cost per task, and wall-clock time. Here's what we found for the four best-performing model families.

Model	Effort	Score	Cost/task	Time
Claude 4.5 Opus	low	54.9%	$0.31	413 s
	high	54.8%	$0.46	375 s
Claude 4.6 Opus	low	53.1%	$0.24	159 s
	high	55.0%	$0.55	461 s
Gemini 3 Flash	low	49.9%	$0.05	390 s
	high	47.9%	$0.14	631 s
GPT-5	low	49.6%	$0.25	526 s
	medium	48.6%	$0.35	469 s
	high	48.1%	$0.39	582 s

Data from our Pareto analysis notebook. Full leaderboard at evals.futuresearch.ai.

GPT-5 shows the most striking pattern: scores decrease monotonically as effort increases. Low effort scores 49.6%, medium drops to 48.6%, and high falls to 48.1%—while cost climbs from $0.25 to $0.39. A 55% premium for a 3-point accuracy penalty.

Gemini 3 Flash tells a similar story. Low effort scores 49.9%; crank it to high and you get 47.9%—a 2-point drop—while paying nearly 3× more per task and waiting 62% longer.

Claude 4.5 Opus is the subtlest case. Low and high effort produce virtually identical scores (54.9% vs 54.8%), but high costs 47% more. You're paying a premium for nothing.

Claude 4.6 Opus, the current leaderboard leader, is the sole exception where high effort meaningfully lifts accuracy (+1.9 points). But that comes at 128% higher cost and 3× the latency (159 s → 461 s). Whether that trade-off is worthwhile depends on your task.

Why does more effort make things worse?

Web research is different from the math and coding problems where chain-of-thought reasoning shines. Solving a competition math problem benefits from longer reasoning traces—each additional step can catch an error or explore a different approach. But when a model gets a higher reasoning budget for research tasks, it doesn't necessarily spend those extra tokens wisely. It second-guesses good initial findings, chases increasingly marginal sources down rabbit holes, and over-qualifies straightforward answers. More cycles spent deliberating doesn't help when the bottleneck is information retrieval, not reasoning depth.

What this means if you're building with LLMs

If you're building AI-powered research workflows, start with the lowest effort level that meets your accuracy bar. Default low-effort configurations are the sweet spot for most models—you'll get equal or better accuracy at a fraction of the cost and latency.

The one exception in our testing was Claude 4.6 Opus, where high effort does meaningfully improve accuracy. But even there, the cost-per-point ratio is steep, so benchmark on your own workload before committing to higher effort across the board.

This is exactly why we default everyrow to configurations that sit on the Pareto frontier—optimising for accuracy per dollar (and per second) rather than chasing diminishing (or negative) returns from cranking up effort.

Check the live leaderboard at evals.futuresearch.ai for the latest numbers—we keep it current as new models ship.

Why does more effort make things worse?

What this means if you're building with LLMs

Related