Question
By what date will OpenAI release a model that is widely regarded as a 'step change' — a model that leads on a majority (>50%) of major LLM benchmarks simultaneously for at least 4 weeks after release?
Summary As of May 2026, the artificial intelligence frontier is highly fragmented. No single model currently leads a majority of the major benchmarks simultaneously 2 sources. OpenAI's latest incremental release, GPT-5.5, leads in three of the six required benchmarks (MATH/AIME, Arena Elo, and ARC-AGI-2) 3 sources, but it falls short in HLE and GPQA 2 sources, meaning it misses the strict majority threshold. The earliest credible opportunity to secure unquestioned leadership is the anticipated release of GPT-6 in late 2026 3 sources. If this release delivers a massive architectural step-change, it is plausible it could sweep the benchmarks and hold that lead for the required 4-week period before competitors respond. However, achieving simultaneous dominance across such diverse, specialized tasks is becoming increasingly difficult as competitors release highly capable models at a rapid cadence. Because maintaining a unified lead for at least 4 weeks introduces extreme friction against rivals like Anthropic and Google, the central expectation rests in mid-2028, accounting for several necessary iteration cycles. Ultimately, there is a roughly 20-25% chance that the ecosystem remains permanently fragmented, meaning OpenAI never regains the unified dominance it held with GPT-4 in 2023. This structural difficulty pushes the latest estimates out to the end of 2030.
Strongest Arguments for Sooner
- GPT-6 is anticipated in the second half of 2026, with reports indicating it will be a generational leap rather than an incremental update 3 sources.
- OpenAI's current model, GPT-5.5, already leads or heavily contests 3 to 4 of the 6 benchmarks (Arena Elo, MATH/AIME, ARC-AGI-2, and SWE-bench) 44 sources.
- A sufficiently powerful GPT-6 release backed by vast computational resources could quickly close the remaining gaps in GPQA and HLE, achieving the required benchmark majority within a single cycle.
Strongest Arguments for Later
- The competitive landscape is extremely crowded, with Anthropic, Google, and others rapidly iterating; Anthropic's restricted Claude Mythos preview currently demonstrates superiority on several fronts 3 sources.
- Maintaining a simultaneous lead across specialized domains for 4 straight weeks is exceptionally difficult when rivals routinely deploy countermeasures and updates within days.
- Key benchmarks like MATH/AIME are becoming saturated, with multiple models scoring 100%, making definitive leadership mathematically difficult or effectively tied introl.com.
- A truly unified step-change has not occurred since GPT-4 in early 2023, establishing a strong historical precedent for ongoing fragmentation 2 sources.
Key Uncertainties
- The exact timing and capability ceiling of GPT-6 and subsequent releases.
- The deployment schedule of competitors, particularly whether Anthropic will launch its dominant Mythos-class capabilities into public, widely available products.
- The relevance and longevity of the specified benchmarks, especially whether saturation will make definitive 4-week leads functionally impossible over the long term.