How well does OpenAI o1 plan and reason on real-world tasks?
September 13, 2024 Towards a Realistic Long-Term Benchmark for Open-Web Research Agents (forthcoming)
We evaluated traces line-by-line of openai-o1, gpt-4o, claude-sonnet-3.5, llama-405b, with several agent architectures, on 8 real-world, white-collar tasks where we knew all the details of what it takes to succeed.
The result? In the words of one of our research scientists: “o1 is very impressive, but pretty moody… sometimes it works great and (almost) completely aces something all other agents struggle with, but often it's distinctly average.”
tl;dr Take o1 for a small chance to go all the way and Sonnet for more consistent performance; gpt-4o is not quite there whereas llama-405b and gpt-4o-mini are barely useful as agents completing tasks that require tool use and 10+ prompted decision points for updating their plans.
Sum of the best scores across agent architectures (post-hoc) on all 8 tasks, for each LLM.
0.0 means no progress, 1.0 completed full set of necessary steps. Each task involves an agent using ~30 prompts as well as tool use.
Our eval is unique in that it is done on "messy" white-collar tasks inspired by work for our customers; and that we have sophisticated partial scores for progress on the numerous things to get right and get wrong. The tasks were of the nature:
Tracking down the original source of a claim
Compiling a list of AI labs that have trained large models
Figuring out the time after a virus emerged before the first seroprevalence study was published
We manually worked through these tasks, making scorecards out of the myriad ways to go wrong. We then filled in scores by reviewing the agent traces line-by-line. We did only one attempt per [Architecture x LLM], as just evaluating o1 cost $750 given the sheer volume of prompts. (So please consider averages, instead of taking an agent’s performance on any single task too seriously.)
We also fixed numerous issues in the agent scaffolding that led to failures unrelated to the models’ planning and reasoning. Agents had a search engine, web parser, and a Python REPL.
We then ran the five models shown here, across the four basic types of agent architecture.
Results
Agents made meaningful progress on tasks of economic value, but fell short of good end-to-end performance on all but the simplest of tasks.
OpenAI-o1-preview and Claude-sonnet-3.5 noticeably outperformed other LLMs, including gpt-4o.
On average, sonnet-3.5 was better than o1-preview. But if you could choose the right architecture for each question - something you would do if the tasks were paid - o1 was better.
Both Llama 3.1 405B and GPT-4o-mini were not able to make meaningful progress on most of the tasks.
There was a lot of variation in success with [Architecture x LLM] pairs. Overall, a ReAct agent with the ability to delegate subtasks performed best.
Average performance across 8 tasks, for each [Architecture x LLM]
0.0 means no progress, 1.0 completed full set of valid steps.
So what do the agents do? Since we did line-by-line evals, we can show in detail what works and what doesn’t.
Example Task: Estimate how many Chinese people have an annual disposable income exceeding 100,000 Yuan.
This is relevant for many questions about Chinese consumer behavior (e.g. how many can afford EVs). Having worked carefully through how to do this well, our scoring rubric gives partial credit for:
Finding data that is:
Up-to-date data (outdated data is common)
Credible or official data (low-quality data is everywhere)
Using the right definition of “disposable income” (not conflating with income, or household income)
Chooses and fits a suitable parametric model (like a log-normal distribution or something heavier-tailed like Pareto) to data; for example:
Log-normal distribution, given mean and percentile
Any reasonable distribution fitting quintile averages
Any reasonable distribution fitting decile averages
Any reasonable distribution given mean/median and Gini coefficient
By far the biggest difficulty here is rejecting bad data, as the web is full of it. The last World Bank data is too old. Statista gives a misleading definition of “middle class”. Many make adjustments you don’t want, like for PPP, or for rural vs. urban. (Many human analysts would likely erroneously use one of these sources!) LLMs are extremely prone to using these anyway.
One core challenge is to pair the right statistical method with what data is found. This requires fairly robust plan updates based on what credible data is actually found.
o1-preview, with subtasks enabled in the architecture, survived the many roadblocks it stumbled on:
It tried and failed to find tables of data from Chinese government sources, e.g. using a Python REPL to download an excel file.
It tried and failed to do a data interpolation method.
But it did find enough high quality data points (averages of quintiles, mean, and median), and found sensible statistical models to work with them.
Its final answer combined estimates from sensible statistical models it came up with itself. This o1 agent got the only answer without making grave mistakes out of all LLMs with all architectures. Here’s the original plan from o1 that ultimately ended up working, the output of the first of many prompts:
## **Person A: Data Collection Specialist**
**Task:**
- **Collect Official Income Data:**
- Visit the National Bureau of Statistics of China's official website or access their latest annual reports.
- Obtain the most recent data (up to the year 2023) on per capita disposable income distribution in China.
- Focus on detailed breakdowns, such as income quintiles, deciles, or specific income brackets.
- Ensure that the data includes information on the number of individuals or households in each income bracket, especially those approaching or exceeding the 100,000 yuan threshold.
**Note:**
- Provide the raw data in a clear and organized format (tables or spreadsheets).
- If available, collect any relevant notes or explanations provided by the NBS regarding the data.
### **Person B: Data Analyst**
**Task:**
- **Analyze Income Distribution:**
- Upon receiving data from Person A, review the income brackets provided.
- Identify the brackets that include incomes over 100,000 yuan per year.
- Calculate the proportion (percentage) of individuals or households that fall into these brackets.
- If the 100,000 yuan threshold falls within a broader income bracket, use interpolation methods to estimate the proportion of individuals exceeding 100,000 yuan.
- **Prepare Findings:**
- Document the analysis process, including any calculations or assumptions made.
- Present the proportion of individuals with disposable income over 100,000 yuan in a clear and concise manner.
**Dependency:**
- Wait for the data collected by Person A before starting this task.
### **Person C: Statistical Modeler**
**Task:**
- **Fit Parametric Distribution:**
- If Person B determines that the data does not provide a precise proportion, use the data from Person A to fit an appropriate parametric distribution (e.g., log-normal, Pareto) to the income data.
- Utilize statistical software or methods to estimate the parameters of the chosen distribution.
- Use the fitted distribution to estimate the proportion of individuals with disposable income exceeding 100,000 yuan per year.
- **Document Methodology:**
- Explain the reasons for choosing the specific distribution model.
- Provide details of the fitting process, parameters estimated, and any goodness-of-fit measures.
**Dependency:**
- Begin after Person B completes the initial analysis. If Person B's results are sufficient, this task may be minimized or skipped.
### **Person D: Calculation Specialist**
**Task:**
- **Obtain Population Data:**
- Retrieve the latest official total population figures of China from the National Bureau of Statistics or other government sources.
- Ensure the population figure corresponds to the same year as the income data collected.
- **Compute Total Number:**
- Using the proportion provided by Person B (or estimated by Person C), calculate the estimated number of individuals with a disposable income over 100,000 yuan per year.
- The calculation should be:
`Number of Individuals = Total Population × Proportion of Individuals over 100,000 yuan`
- **Prepare Report:**
- Present the final number in a clear format.
- Include all calculation steps and cite the data sources used.
**Dependency:**
- Wait for the proportion from Person B (or estimates from Person C).
### **Person E: Quality Assurance Reviewer**
**Task:**
- **Review and Verify:**
- Examine the data collected by Person A for completeness and accuracy.
- Review the analysis and calculations done by Persons B, C, and D.
- Check the correctness of methods, calculations, and assumptions made throughout the process.
- **Compile Final Report:**
- Summarize the findings in a cohesive report.
- Include an executive summary, methodology, results, and any limitations or considerations.
- Ensure that all sources are properly cited.
- **Provide Recommendations:**
- Note any data limitations or areas where further research may be necessary.
- Suggest any improvements for future analyses.
**Dependency:**
- Begin after receiving the outputs from Persons A through D.
**Additional Notes:**
- **Time Coordination:** Since some tasks depend on the completion of others, please communicate your progress promptly upon task completion.
- **Data Integrity:** Ensure that all data used are the latest official figures up to the year 2023 to maintain consistency and accuracy.
- **Assumptions and Limitations:** Document any assumptions made during the analysis, especially if exact data is not available and estimations are used.
- **Confidentiality:** Handle all data responsibly and adhere to any usage guidelines specified by the data sources.
By dividing the work among the team with clear tasks and dependencies, we aim to efficiently and accurately determine the number of Chinese individuals with a disposable income exceeding 100,000 yuan per year.
Overall, on this task:
o1 got 1 correct answer, and 2 partially good answers, and 1 failure
sonnet-3.5 got 0 correct answers, 2 partially good answers, and 2 failures
gpt-40 got 0 correct answers, 1 partially good answer, and 3 failures
llama-3.1-405b and gpt-4o-mini got 0 correct answers, 0 partially good answers, and 4 failures.
The full scorecard for 1 of the 8 tasks: “Estimate how many Chinese people have an annual disposable income exceeding 100,000 Yuan.“ Green means got partial credit on that part of the task. The right column shows that only one combination got the right answer: plan-delegate-o1.
Next steps
Reading through many o1 traces, it’s clear it behaves differently than other models. It refuses more; it is less consistent with when to plan as the agent is architected, sometimes trying to solve the entire problem up front. Likely with prompt-tuning in the agent scaffolding, its lead above sonnet-3.5 would increase.
Please reach out if you’d like to be notified when the full paper is ready.