OpenAI Deep Research: Six Strange Failures
February 19, 2025
On February 3, OpenAI launched Deep Research, their long-form research tool, provoking a flurry of interest and intrigued reactions. Many were seriously impressed:
Others, such as the top Hacker News comment after the launch, warned of frequent inaccuracies:
Zvi Mowshowitz reported a (secondhand) similar warning:
Given these polarized responses, we at FutureSearch decided to run Open AI Deep Research (hereafter “OAIDR”) on our internal evals: tricky web research tasks with clearly correct answers and rigorous scoring.
What did we find?
OAIDR is clearly better than Gemini Deep Research, Perplexity Deep Research, and DeepSeek-R1
It’s much worse on average than intelligent humans, even when the humans have no expertise and are time limited
It has a “jagged frontier”, as Ethan Mollick would say. Any given output might be great, or might be missing something crucial, and it’s very hard to tell the difference
What were OAIDR’s failure modes?
More than other systems like Gemini Deep Research, OAIDR is overconfident, often reporting wrong answers when it should admit it couldn’t find something
It is peculiar in selecting sources, e.g. choosing company blogs or SEO-spam sites when most humans would get their information from Wikipedia or mainstream news
It struggles to read many webpages, e.g. failing to read long PDFs, get data from images, or read x.com
We corroborated what Simon Willison told us: “My biggest worry with it is misinformation by omission: it is difficult/impossible? to evaluate if it missed crucial information just from reviewing its output."
So how and when should one use OAIDR?
These FutureSearch evals focus on getting things right when details matter. There are many other ways to use OAIDR; we don’t (yet) have evals for fuzzier contexts, but we have tentatively established these qualitative impressions:
OAIDR is great for synthesizing information when completeness isn’t key, e.g. making a partial dataset that you later complete
OAIDR is good, but risky, for getting an introduction to a new topic. It will look thorough, but may miss the single most important consideration - see the last example below.
OAIDR can truly shine on niche, qualitative explorations where o3 follows a “chain of curiosity” (another Ethan Mollickism)
Read on for some stylized examples of misbehavior on seemingly simple research questions with a known correct answer.
Six Strange Failures of OAIDR
These are drawn from a larger set of private FutureSearch evals. In brief, these tasks evaluate how well a system can find non-obvious data points which span many websites, where it is easy to be fooled by lower-quality sources with misleading data — but are solvable by most humans without too much difficulty.
These six examples are given in increasing difficulty, from basic factual questions to difficult research & compilation tasks.
Note: for all the cases below, OAIDR responded with a follow-up request for refinement, but in our evals’ contexts these requests are never useful, as our questions were specified to be correctly and completely answerable.
Query #1: “What is the highest reported agent performance on the Cybench benchmark, using the score based on first attempts?”
The correct answer (as of 2025-02-15) is 34.5%, achieved by o1 in Dec 2024, which can be found in this PDF from NIST.
OAIDR incorrectly gives 17.5%, a score from Claude-Sonnet-3.5 in August 2024 reported here.
After it finds this first data point, instead of thinking Have newer models beaten this?, it thinks Is this corroborated by other sources? It is - but that’s the wrong question!
Amusingly, despite being based on o3 itself, OAIDR also doesn’t mention o1 a single time in its “thoughts” or output! Could be due to training window cutoffs, but a good research plan should have done a search for all the top models.
This was the fastest output, at 2 minutes, clearly not long enough.
Query #2: “What is the cumulative number of excess deaths per million people in the UK population by the end of 2023, according to Our World in Data (OWID)?”
The correct answer here is 3,473 excess deaths per million. There are two ways to see this: by hovering over the UK on the map on this page, or by reading the CSV available at this OWID GitHub link.
OAIDR incorrectly gives 2,730.
First, as it confusingly states at the end, it finds OWID data from before the end of 2023 that it extrapolates out to the incorrect 2,730 number.
Then, failing to find OWID data with the full year 2023, it tries to corroborate this earlier method. It finds data from ons.gov.uk, as well as from Reddit (!) that suggest this number is right, so it returns it.
Two problems. First, this final statement, “OWID does not explicitly publish the final UK figure in text” is wrong, it is in the CSV in GitHub that it almost found. In fact GPT-4o has this URL memorized! (URLs linked in the OAIDR “thoughts” are domains only, not full URLs, so the user can’t click in.)
Secondly, if an answer can’t be found, saying that, rather than returning a (wrong) estimate, would be better for the user. Most human would find this output very misleading.
Query #3: The source https://www.indusface.com/blog/key-cybersecurity-statistics/ contains the following claim: "In 2024, cybercrime costs are projected to reach $8 trillion globally." What is the original source of this claim?
The correct answer is a PDF from CyberSecurity Ventures. And in fact the IndusFace source mis-states the claim, which is about 2023, not 2022.
OAIDR incorrectly says the claim comes from Statista. Oddly, it doesn’t include a link to Statista in the output, giving other secondary sources like this one!
In the “thoughts”, OAIDR did in fact find the a link from CyberSecurity Ventures citing the correct original source:
But it failed to trace this through, and pursued a variety of other angles. It tried to do too much, spending 4 minutes on this when a human would likely get the correct answer faster.
You may notice this claim of $8T/year is absurd on its face, so finding the original source is important to understand if it is credible. Statista is a well-regarded source, and OAIDR gives this incorrectly as the original source quite confidently, so the reader might believe the number is good, when in fact the true original source is clearly biased.
Query #4: Find the value of the following number: "Global market size for alternative meat products, also known as plant-based meat and seafood, in 2023". Unit = "billions of USD, rounded to the nearest hundred million". You should return the most reliable answer you can find.
The correct answer is $6.4B, and the most credible source here is the Global Food Institute, which has a well-researched 2023 State of the Industry PDF with this number.
In just over 3 minutes, OAIDR returns with a market size of $7.2B, citing a low-quality estimate from the SEO-spam website Grand View Research.
Most humans would think an industry specialist, and especially a non-profit research org, would give a better estimate than a site that claims to have market estimates on every single industry. And interestingly, OAIDR did find gfi.org. But it failed to read the pdf with the answer!
OAIDR claimed 6 times to read gfi.org or read more from gfi.org, without ever producing any information from this source. Rather than admitting that, OAIDR claims that it has been successful in reading the doc but that the “depth required is significant”.
We conclude “thoughts” OAIDR here may not be what OAIDR is actually doing.
Query #5: For Check Point Software produce a list of every product or service that they offer with a brief description of what that offering is. Make a table where the first column is every product or service they have. There will be 5 more columns, one for each of these cyber attacks: Phishing, Spear-phishing, 0-day software exploits, supply-chain cyber attacks, and Malware-as-a-service. Here is a description of each: <omitted for brevity>. If there is a clear relationship between the offering and the attack, assign a 1 to that cell. If you are unsure, assign a 0.
This task has a much longer prompt, and a much longer output, which is more to the style of what we suspect most people use OAIDR for. At first glance the response looks great. The table has 34 rows, and spot checking the first few numbers indicates they are correct.
But this first sample could be misleading. Manual verification indicates that 118 out of the 170 cells were correct, so 69%, with most of the errors in the bottom rows.
Even worse, though, was that the list of Check Point Software actually has 45 offerings, not the 34 OAIDR found. So about a quarter of the data is silently missing.
And this was far from obvious. Problems of recall, rather than precision, are hard to spot. How can you tell something is missing, especially when the output is seemingly so thorough?
A human would rarely make a mistake like this. Finding the correct list of offerings from Check Point Software isn’t trivial (the website is a mess), but it isn’t hard.
Query #6: Please gather all key items of evidence to help answer the following query: What is OpenAI currently working on? You must return 20 items.
Here, our eval looks for whether the answer includes that OpenAI is working on a successor to the o3 model, as announced by OpenAI CPO Kevin Weil in late Jan 2025 at Davos. We think any good answer should include this in its top 20, as AI experts would consider further models in the o3 line to be amongst the most important developments in pushing frontier intelligence.
OAIDR did not find this. In fact it refers to o3 as an “upcoming model”, seeming to confuse the development of o3 with the public release of o3. And even this information was number 19 on its list of 20, many pages into the output, and almost not making the cut!
The answer OAIDR gave overall was both broad and deep, and for an outsider, would be extremely helpful to understand OpenAI’s roadmap. But here lies the danger, as having authoritative, voluminous output, the user would likely not notice or even suspect that it misses what many would consider the single most important consideration.
We will continue to evaluate the performance of other new systems, as they emerge, on tricky-but-objective web research tasks.
Stay tuned for Perplexity Deep Research and grok-3!