How well does OpenAI o1 plan and reason on real-world tasks?

September 13, 2024 Towards a Realistic Long-Term Benchmark for Open-Web Research Agents (forthcoming)

We evaluated agent traces, line-by-line, of openai-o1, gpt-4o, claude-sonnet-3.5, llama-405b, with several agent architectures, on 8 real-world, white-collar, multi-stage tasks (with tool use) where we knew all the pitfalls.

The result? In the words of one of our research scientists: “o1 is very impressive, but pretty moody… sometimes it works great and (almost) completely aces something all other agents struggle with, but often it's distinctly average.

tl;dr Use an o1-powered agent for a small chance to go all the way, and use Sonnet for more consistent performance; gpt-4o is not quite there. llama-405b and gpt-4o-mini are barely useful as agents completing multi-stage tasks.

Average scores across tasks

Sum of the best scores across agent architectures (post-hoc) on all 8 tasks, for each LLM.

0.0 means no progress, 1.0 completed full set of necessary steps. Each task involves an agent using ~30 prompts as well as tool use.

The tasks

We do "messy" white-collar tasks inspired by work for our customers. We give sophisticated partial scores for progress on the numerous things to get right and get wrong. The tasks were of the nature:

  • Tracking down the original source of a claim

  • Compiling a list of AI labs that have trained large models

  • Figuring out the time after a virus emerged before the first seroprevalence study was published

We manually worked through these tasks, making scorecards out of the myriad ways to go wrong. We then filled in scores by reviewing the agent traces line-by-line. We did only one attempt per [Architecture x LLM], as just evaluating o1 cost $750 given the sheer volume of prompts. (So please consider averages, instead of taking an agent’s performance on any single task too seriously.)

We also fixed numerous issues in the agent scaffolding that led to failures unrelated to the models’ planning and reasoning. Agents had a search engine, web parser, and a Python REPL.

We then ran the five models shown here, across the four basic types of agent architecture.

Results

  • Agents made meaningful progress on tasks of economic value, but fell short of good end-to-end performance on all but the simplest of tasks.

  • OpenAI-o1-preview and Claude-sonnet-3.5 noticeably outperformed other LLMs, including gpt-4o.

    • On average, sonnet-3.5 was better than o1-preview. But if you could choose the right architecture for each question - something you would do if the tasks were paid - o1 was better.

  • Both Llama 3.1 405B and GPT-4o-mini were not able to make meaningful progress on most of the tasks.

  • There was a lot of variation in success with [Architecture x LLM] pairs. Overall, a ReAct agent with the ability to delegate subtasks performed best.

Average performance across tasks

Average performance across 8 tasks, for each [Architecture x LLM]

0.0 means no progress, 1.0 completed full set of valid steps.

Some anecdotes about o1 from line-by-line evals:

  • o1 still has a significant hallucination problem. On the task “How much time passed from the beginning of 2009 H1N1 until the rst seroprevalence study was published?", it hallucinated a paper from “Balish et al”, and then later hallucinated a publication date.

  • o1 has strange refusals.

  • o1 seems overconfident. It tried to complete tasks without the requisite information. It then, less than Claude-Sonnet-3.5, did less external research to validate its claims

  • o1 is better than other models at making initial plans, and revising those plans. On the below question, it devised two subtle & good statistical methods, and when one failed, switched to the other.

    • o1’s plans can be extremely verbose, but in a good way, as other LLMs suffer from dropping important (implied) details from their plans. See the example at the bottom.

  • o1’s plans make better use of latent knowledge. e.g., ask GPT-4o what format China publishes data on disposable income, and it knows they only publish mean, median, and averages over quintiles. But tasked with coming up with a plan that needs this data, GPT-4o seems to “forget” this, whereas o1 does not.

Detailed results on one task:

Estimate how many Chinese people have an annual disposable income exceeding 100,000 Yuan.

This is relevant for many questions about Chinese consumer behavior (e.g. how many can afford EVs). Having worked carefully through how to do this well, our scoring rubric gives partial credit for:

  • Finding data that is:

    • Up-to-date data (outdated data is common)

    • Credible or official data (low-quality data is everywhere)

    • Using the right definition of “disposable income” (not conflating with income, or household income)

  • Fits parameters to a good statistical model, one of:

    • Log-normal distribution, given mean and percentile

    • Any reasonable distribution fitting quintile data

    • Any reasonable distribution fitting decile data

    • Any reasonable distribution given mean/median and Gini coefficient

By far the biggest difficulty here is rejecting bad data, as the web is full of it. The last World Bank data is too old. Statista gives a misleading definition of “middle class”. Many make adjustments you don’t want, like for PPP, or for rural vs. urban. (Many human analysts would likely erroneously use one of these sources!) LLMs are extremely prone to using these anyway.

One core challenge is to pair the right statistical method with what data is found. This requires fairly robust plan updates based on what credible data is actually found.

o1-preview, with subtasks enabled in the architecture, survived the many roadblocks it stumbled on:

  • It tried and failed to find tables of data from Chinese government sources, e.g. using a Python REPL to download an excel file.

  • It tried and failed to do a data interpolation method.

  • But it did find enough high quality data points of mean and median, and figured out fitting a log-normal would work.

It gave a final answer of 2.08%. This o1 agent got the only accurate answer out of all LLMs with all architectures. Here’s the original plan from o1 that ultimately ended up working, the output of the first of many prompts:

## **Person A: Data Collection Specialist**

**Task:**

  • - **Collect Official Income Data:**

    - Visit the National Bureau of Statistics of China's official website or access their latest annual reports.

    - Obtain the most recent data (up to the year 2023) on per capita disposable income distribution in China.

    - Focus on detailed breakdowns, such as income quintiles, deciles, or specific income brackets.

    - Ensure that the data includes information on the number of individuals or households in each income bracket, especially those approaching or exceeding the 100,000 yuan threshold.

    **Note:**

    - Provide the raw data in a clear and organized format (tables or spreadsheets).

    - If available, collect any relevant notes or explanations provided by the NBS regarding the data.

    ### **Person B: Data Analyst**

    **Task:**

    - **Analyze Income Distribution:**

    - Upon receiving data from Person A, review the income brackets provided.

    - Identify the brackets that include incomes over 100,000 yuan per year.

    - Calculate the proportion (percentage) of individuals or households that fall into these brackets.

    - If the 100,000 yuan threshold falls within a broader income bracket, use interpolation methods to estimate the proportion of individuals exceeding 100,000 yuan.

    - **Prepare Findings:**

    - Document the analysis process, including any calculations or assumptions made.

    - Present the proportion of individuals with disposable income over 100,000 yuan in a clear and concise manner.

    **Dependency:**

    - Wait for the data collected by Person A before starting this task.

    ### **Person C: Statistical Modeler**

    **Task:**

    - **Fit Parametric Distribution:**

    - If Person B determines that the data does not provide a precise proportion, use the data from Person A to fit an appropriate parametric distribution (e.g., log-normal, Pareto) to the income data.

    - Utilize statistical software or methods to estimate the parameters of the chosen distribution.

    - Use the fitted distribution to estimate the proportion of individuals with disposable income exceeding 100,000 yuan per year.

    - **Document Methodology:**

    - Explain the reasons for choosing the specific distribution model.

    - Provide details of the fitting process, parameters estimated, and any goodness-of-fit measures.

    **Dependency:**

    - Begin after Person B completes the initial analysis. If Person B's results are sufficient, this task may be minimized or skipped.

    ### **Person D: Calculation Specialist**

    **Task:**

    - **Obtain Population Data:**

    - Retrieve the latest official total population figures of China from the National Bureau of Statistics or other government sources.

    - Ensure the population figure corresponds to the same year as the income data collected.

    - **Compute Total Number:**

    - Using the proportion provided by Person B (or estimated by Person C), calculate the estimated number of individuals with a disposable income over 100,000 yuan per year.

    - The calculation should be:

    `Number of Individuals = Total Population × Proportion of Individuals over 100,000 yuan`

    - **Prepare Report:**

    - Present the final number in a clear format.

    - Include all calculation steps and cite the data sources used.

    **Dependency:**

    - Wait for the proportion from Person B (or estimates from Person C).

    ### **Person E: Quality Assurance Reviewer**

    **Task:**

    - **Review and Verify:**

    - Examine the data collected by Person A for completeness and accuracy.

    - Review the analysis and calculations done by Persons B, C, and D.

    - Check the correctness of methods, calculations, and assumptions made throughout the process.

    - **Compile Final Report:**

    - Summarize the findings in a cohesive report.

    - Include an executive summary, methodology, results, and any limitations or considerations.

    - Ensure that all sources are properly cited.

    - **Provide Recommendations:**

    - Note any data limitations or areas where further research may be necessary.

    - Suggest any improvements for future analyses.

    **Dependency:**

    - Begin after receiving the outputs from Persons A through D.

    **Additional Notes:**

    - **Time Coordination:** Since some tasks depend on the completion of others, please communicate your progress promptly upon task completion.

    - **Data Integrity:** Ensure that all data used are the latest official figures up to the year 2023 to maintain consistency and accuracy.

    - **Assumptions and Limitations:** Document any assumptions made during the analysis, especially if exact data is not available and estimations are used.

    - **Confidentiality:** Handle all data responsibly and adhere to any usage guidelines specified by the data sources.

    By dividing the work among the team with clear tasks and dependencies, we aim to efficiently and accurately determine the number of Chinese individuals with a disposable income exceeding 100,000 yuan per year.

Overall, on this task:

  • o1 got 1 correct answer, and 2 partially good answers, and 1 failure

  • sonnet-3.5 got 0 correct answers, 2 partially good answers, and 2 failures

  • gpt-40 got 0 correct answers, 1 partially good answer, and 3 failures

  • llama-3.1-405b and gpt-4o-mini got 0 correct answers, 0 partially good answers, and 4 failures.

The full scorecard for 1 of the 8 tasks: “Estimate how many Chinese people have an annual disposable income exceeding 100,000 Yuan.“ Green means got partial credit on that part of the task. The right column shows that only one combination got the right answer: plan-delegate-o1.

Next steps

Reading through many o1 traces, it’s clear it behaves differently than other models. It refuses more; it is less consistent with when to plan as the agent is architected, sometimes trying to solve the entire problem up front. Likely with prompt-tuning in the agent scaffolding, its lead above sonnet-3.5 would increase.

Please reach out if you’d like to be notified when the full paper is ready.