Deep Research Bench

Evaluating LLM-Agents on Web Search Tasks

Paper Live Leaderboard

Deep Research Bench (DRB) benchmarks how well LLM agents do research on the web. Each of the 91 real-world tasks provides 10-100k webpages stored offline for search and reasoning, accompanied by carefully curated answers.

Deep Research Bench has two features no other benchmarks of LLM-powered web agents have:

  • Stores large chunks of the web offline, so results are stable even as the web changes

  • The correct answers are carefully worked out, given the state of the web at that time, so scores are as objective as possible

Leaderboard

We continuously update DRB, adding new tasks and models as they get released to the leaderboard.

Performance of Commercial LLM Research Tools

In addition to the regular “retro” evals on a frozen snapshot, we also evaluated several “live” commercial LLM research tools like OpenAI Deep Research or Perplexity on the May 2005 version of Deep Research Bench.

We found, for example, ChatGPT o3 to outperform all other approaches (including OpenAI Deep Research!) by a comfortable margin. For all details, see the paper.

scores for various web research tools

Scores across 8 task categories for various web research tools on the May 2025 version of DRB.