Real-world evals of the OpenAI's o1, the first "thinking" model

Paper | Live Deep Research Bench Leaderboard

Key Takeaways

OpenAI's o1 model, the first thinking/reasoning model to be made commercially available, was able to solve some new complex tasks no non-thinking model at the time could solve
OpenAI o1 and Claude Sonnet 3.5 significantly outperformed other LLMs including GPT-4o on complex real-world tasks
o1 shows high variance: sometimes completely aces tasks others struggle with, but often performs average
Most agents made meaningful progress but fell short of good end-to-end performance on all but the simplest tasks
Llama 3.1 405B and GPT-4o-mini showed limited usefulness for multi-step agent tasks

Average scores across tasks showing performance comparison of different LLMs

Evaluating LLMs on Real-World White-Collar Tasks

We evaluated traces line-by-line of openai-o1, gpt-4o, claude-sonnet-3.5, and llama-405b, with several agent architectures, on 8 real-world, white-collar tasks where we knew all the details of what it takes to succeed.

The result? In the words of one of our research scientists: "o1 is very impressive, but pretty moody… sometimes it works great and (almost) completely aces something all other agents struggle with, but often it's distinctly average."

Bottom line: Take o1 for a small chance to go all the way and Sonnet for more consistent performance; gpt-4o is not quite there whereas llama-405b and gpt-4o-mini are barely useful as agents completing tasks that require tool use and 10+ prompted decision points for updating their plans.

Our Evaluation Methodology: Messy White-Collar Tasks with Partial Credit

Our eval is unique in that it is done on "messy" white-collar tasks inspired by work for our customers, and that we have sophisticated partial scores for progress on the numerous things to get right and get wrong.

The tasks were of the nature:

Tracking down the original source of a claim
Compiling a list of AI labs that have trained large models
Figuring out the time after a virus emerged before the first seroprevalence study was published

Performance heatmap showing Architecture x LLM performance across different tasks

Example Task: Estimating Chinese Disposable Income Over 100,000 Yuan

One detailed example task involved estimating the number of Chinese people with more than 100,000 Yuan annual disposable income. This task demonstrated o1's unique approach of breaking tasks into well-defined subtasks and its ability to adapt plans and reject poor data sources.

The evaluation used a sophisticated partial credit scoring system that assessed progress across multiple dimensions, allowing us to measure meaningful advancement even when agents didn't achieve complete success.

Task scorecard showing detailed evaluation metrics

Results: o1 and Sonnet Lead, But End-to-End Performance Remains Challenging

Main Findings

Agents made meaningful progress on tasks of economic value, but fell short of good end-to-end performance on all but the simplest of tasks
OpenAI o1-preview and Claude Sonnet 3.5 noticeably outperformed other LLMs, including GPT-4o
- On average, Sonnet 3.5 was better than o1-preview
- But if you could choose the right architecture for each question—something you would do if the tasks were paid—o1 was better
Both Llama 3.1 405B and GPT-4o-mini were not able to make meaningful progress on most of the tasks
There was significant variation in success with [Architecture x LLM] pairs. Overall, a ReAct agent with the ability to delegate subtasks performed best

Practical Recommendations

For teams deploying LLM agents on complex research tasks, our findings suggest a two-pronged strategy: deploy o1 for high-value tasks where breakthrough performance justifies the variance, and rely on Sonnet 3.5 for workflows requiring predictable, consistent results.