Deep Research - Persist or Adapt?

“Deep Research” tools—Gemini Deep Research, OpenAI Deep Research, and Perplexity Deep Research—are web agents, meaning they perform many LLM calls, and use tools like web search. Like humans, when they hit obstacles, sometimes agents sometimes double down and persist, and sometimes change their approach and adapt.

But they don’t do this the way humans do. Which isn’t necessarily bad! But it does often lead to surprising failures on easy tasks for humans, as we reported last week for OpenAI Deep Research (”OAIDR”). We’ve now also run Perplexity Deep Research (”PDR”) on FutureSearch’s internal evals of tricky web research tasks with clearly correct answers and rigorous scoring.

Here’s how Deep Research tools act inhumanly, on two opposite dimensions: failures to persist, and failures to adapt.

When to persist

Consider this rare query which PDR gets right and OAIDR fails: “What is the original source of the claim ‘Over 200,000 more cybersecurity workers are needed in the United States to close the talent gap, according to data from CyberSeek’ made by: https://www.securityweek.com/225000-more-cybersecurity-workers-needed-in-us-cyberseek/?”

First, OAIDR tries, and fails, to read the interactive map on cyberseek.org. It claims to ‘read more’, but this likely refers to repeated failures to read dynamic content (and we’ve seen this with pdfs too). It’s not actually “reading more.”

While it then mentions Lightcast job postings data, it doesn’t actually get to the bottom of it. OAIDR instead gives up, and says the original source of this claim is… the claim itself!

PDR took many steps, but persisted when OAIDR gave up:

And it returns the correct answer: Lightcast.

Here, a human would almost always do what PDR did, and persist even when it struggles to read pages from CyberSeek, understanding that the task requires it and that the initial claim is very unlikely to also be the original source.

When to adapt

What about when it’s adaption you need, and stubborn persistence is a liability?

Take this example, where both OAIDR and PDR miss the chance to adapt based on what they find: “What is the total number of excess deaths per million people in the UK population by the end of 2023, according to Our World in Data?”

Both PDR and OAIDR fail to correctly answer the question, by finding the right angle, but failing to dig into that angle.

PDR persists way too hard and fails to adapt. It re-runs variants of the same search queries “UK excess deaths per million 2023” six times! You can see that PDR found the same 4-5 sources again and again:

Any one of these had the clue for how to get this one right: read the interactive graph or find the dataset. Humans do this easily, see the interactive map provided by OWID, and in the raw GitHub data. PDR does see it should do this, but can’t adapt to find a way:

We reported how OAIDR went wrong on this question last week in a similar way. Here, we simply note that it saw another approach: looking at tweets from the OWID authors, which could have helped - but it didn’t actually pursue this avenue:

Deep Research systems often pose ideas to adapt their research to what they find, and then mysteriously fail to follow through on their ideas!

True adaptation is matching effort to the demands of the task.

Humans are not only better at knowing when to persist or adapt. They are also capable of persisting and adapting for much longer. Today’s AI systems will eventually go haywire when tasks get too long and the context windows explode.

The challenge is: you rarely know up front whether your web research task is easy or hard!

When a task is easy, Deep Research systems will be much faster, and will corroborate the correct answer more robustly than a human would.

When a task is medium-difficulty, it’s a toss-up: Deep Research systems might balance persisting vs. adapting correctly, and save you 1-2 hours. Or it might fail in ways that are hard to detect, which can be dangerous if you don’t carefully check everything.

When a task is hard, Deep Research systems “spin out”: repeatedly doing the same thing, like when PDR searches the same thing over and over again. Or when OAIDR gets so overwhelmed with context that it ends up answering the wrong question entirely, like when it starts looking at other countries rather than the one asked about!

None of today’s Deep Research tools can answer hard tasks that require rigorous accuracy. At least not yet.

Deep Research - When to Persist and When to Adapt?

Company

Follow Us