LLM Cost vs. Accuracy
This analysis compares 26 model configurations on the Deep Research Bench (DRB), which evaluates models on agentic web-research tasks. It computes Pareto frontiers across cost, speed, and accuracy tradeoffs to understand which models deliver the best accuracy per dollar.
| Metric | Value |
|---|---|
| Models evaluated | 26 |
| FutureSearch cost | $0.00 |
This analysis doesn't use FutureSearch's MCP tools. It fetches benchmark data from the DRB public API and computes Pareto frontiers locally.
Add FutureSearch to Claude Code if you haven't already:
claude mcp add futuresearch --scope project --transport http https://mcp.futuresearch.ai/mcp
Tell Claude:
Fetch results from the Deep Research Bench public API and compute the cost
Pareto frontier across all 26 model configurations. Then map FutureSearch's
effort levels to models on or near the frontier.
This notebook fetches benchmark data from the DRB public API and computes Pareto frontiers locally.
pip install futuresearch requests pandas
import requests
import pandas as pd
url = "https://rguraxphqescakvvzmju.supabase.co/rest/v1/rpc/get_average_scores_by_model"
PUBLIC_API_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
headers = {
"apikey": PUBLIC_API_KEY,
"authorization": f"Bearer {PUBLIC_API_KEY}",
"content-type": "application/json",
}
response = requests.post(url, headers=headers, json={"min_num_of_distinct_instances": 150})
df = pd.DataFrame(response.json())
To override FutureSearch's default model selection:
from futuresearch.ops import agent_map
from futuresearch.task import LLM
result = await agent_map(
task="Find each company's latest funding round",
input=companies_df,
effort_level=None,
llm=LLM.CLAUDE_4_6_OPUS_HIGH,
iteration_budget=10,
include_research=True,
)
Results
The cost Pareto frontier (7 models that achieve the best accuracy for their price):
| Model | Cost | DRB Score |
|---|---|---|
| GPT-5.1 (low) | $0.040 | 0.428 |
| Gemini 3 Flash (low) | $0.051 | 0.499 |
| Gemini 3 Flash (minimal) | $0.103 | 0.504 |
| Claude 4.6 Opus (low) | $0.243 | 0.531 |
| Claude 4.5 Opus (low) | $0.312 | 0.549 |
| Claude 4.6 Sonnet (high) | $0.456 | 0.549 |
| Claude 4.6 Opus (high) | $0.553 | 0.550 |
FutureSearch's effort levels map directly to models on or near these frontiers:
| Effort Level | Model | DRB Score | Cost | Runtime |
|---|---|---|---|---|
| LOW | Gemini 3 Flash (minimal) | 0.504 | $0.103 | 116s |
| MEDIUM | Gemini 3 Flash (low) | 0.499 | $0.051 | 96s |
| HIGH | Claude 4.6 Opus (low) | 0.531 | $0.243 | 73s |
The bulk of accuracy (0.531 out of 0.550 max) comes at less than half the cost of the best model. Going from HIGH to the absolute best (Claude 4.6 Opus high) doubles the cost for only a 3.6% accuracy improvement.