FutureSearch Logofuturesearch
  • Solutions
  • Pricing
  • Research
  • Docs
  • Evals
  • Blog
  • Company
  • LiteLLM Checker
  • Get Researchers
FutureSearch Logo

General inquiry? You can reach us at hello@futuresearch.ai.

Company

Team & CareersPressPrivacy PolicyTerms of Service

Developers

SDK DocsAPI ReferenceCase StudiesGitHubSupport

Integrations

Claude CodeCursorChatGPT CodexClaude.ai

Follow Us

X (Twitter)@dschwarz26LinkedIn
FutureSearchdocs
Your research team
Installation
  • All install methods
  • Claude.ai
  • Claude Cowork
  • Claude Code
  • Web App
  • Python SDK
  • Skill
  • MCP Server
Reference
  • API Key
  • classify
  • dedupe
  • forecast
  • merge
  • rank
  • agent_map
  • Progress Monitoring
  • Chaining Operations
Guides
  • LLM-Powered Data Labeling
  • Add a Column via Web Research
  • Classify and Label Rows
  • Deduplicate Training Data
  • Filter a Dataset Intelligently
  • Find Profitable Polymarket Trades
  • Forecast Outcomes for a List of Entities
  • Value a Private Company
  • Join Tables Without Shared Keys
  • Rank Data by External Metrics
  • Resolve Duplicate Entities
  • Scale Deduplication to 20K Rows
  • Turn Claude into an Accurate Forecaster
Case Studies
  • Deduplicate Contact Lists
  • Deduplicate CRM Records
  • Enrich Contacts with Company Data
  • Fuzzy Match Across Tables
  • Link Records Across Medical Datasets
  • LLM Cost vs. Accuracy
  • Merge Costs and Speed
  • Merge Thousands of Records
  • Multi-Stage Lead Qualification
  • Research and Rank Web Data
  • Run 10,000 LLM Web Research Agents
  • Score Cold Leads via Web Research
  • Score Leads from Fragmented Data
  • Screen 10,000 Rows
  • Screen Job Listings
  • Screen Stocks by Economic Sensitivity
  • Screen Stocks by Investment Thesis
FutureSearchby futuresearch
by futuresearch

LLM Cost vs. Accuracy

This analysis compares 26 model configurations on the Deep Research Bench (DRB), which evaluates models on agentic web-research tasks. It computes Pareto frontiers across cost, speed, and accuracy tradeoffs to understand which models deliver the best accuracy per dollar.

MetricValue
Models evaluated26
FutureSearch cost$0.00

This analysis doesn't use FutureSearch's MCP tools. It fetches benchmark data from the DRB public API and computes Pareto frontiers locally.

Add FutureSearch to Claude Code if you haven't already:

claude mcp add futuresearch --scope project --transport http https://mcp.futuresearch.ai/mcp

Tell Claude:

Fetch results from the Deep Research Bench public API and compute the cost
Pareto frontier across all 26 model configurations. Then map FutureSearch's
effort levels to models on or near the frontier.

This notebook fetches benchmark data from the DRB public API and computes Pareto frontiers locally.

pip install futuresearch requests pandas
import requests
import pandas as pd

url = "https://rguraxphqescakvvzmju.supabase.co/rest/v1/rpc/get_average_scores_by_model"
PUBLIC_API_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."

headers = {
    "apikey": PUBLIC_API_KEY,
    "authorization": f"Bearer {PUBLIC_API_KEY}",
    "content-type": "application/json",
}

response = requests.post(url, headers=headers, json={"min_num_of_distinct_instances": 150})
df = pd.DataFrame(response.json())

To override FutureSearch's default model selection:

from futuresearch.ops import agent_map
from futuresearch.task import LLM

result = await agent_map(
    task="Find each company's latest funding round",
    input=companies_df,
    effort_level=None,
    llm=LLM.CLAUDE_4_6_OPUS_HIGH,
    iteration_budget=10,
    include_research=True,
)

Results

The cost Pareto frontier (7 models that achieve the best accuracy for their price):

ModelCostDRB Score
GPT-5.1 (low)$0.0400.428
Gemini 3 Flash (low)$0.0510.499
Gemini 3 Flash (minimal)$0.1030.504
Claude 4.6 Opus (low)$0.2430.531
Claude 4.5 Opus (low)$0.3120.549
Claude 4.6 Sonnet (high)$0.4560.549
Claude 4.6 Opus (high)$0.5530.550

FutureSearch's effort levels map directly to models on or near these frontiers:

Effort LevelModelDRB ScoreCostRuntime
LOWGemini 3 Flash (minimal)0.504$0.103116s
MEDIUMGemini 3 Flash (low)0.499$0.05196s
HIGHClaude 4.6 Opus (low)0.531$0.24373s

The bulk of accuracy (0.531 out of 0.550 max) comes at less than half the cost of the best model. Going from HIGH to the absolute best (Claude 4.6 Opus high) doubles the cost for only a 3.6% accuracy improvement.