Unit Economics of LLM APIs
On July 24, The Information reported that OpenAI Could Lose $5 Billion This Year. While it’s plausible that ChatGPT consumer and enterprise operate at a big loss, we find that the API business is profitable.
The Information’s numbers are not consistent. Depending on which you believe, the API has a profit margin of -670% or +40%. (To their credit, they report that OpenAI told them their numbers are inaccurate.)
We estimate that, at the time of that writing, the gross profit margin — from revenue and inference costs — was 75%. And once traffic moves to the new August 2024 model and prices, we project the gross margin will decrease to 55%.
We make the case that:
OpenAI has provisioned Azure GPUs from Microsoft far beyond the API demand they actually have
API revenue has decreased faster from price cuts than traffic has grown
Unit economics for LLMs come down to saving on GPU costs with better memory usage
We model how many GPT-4 output tokens OpenAI can produce each second with an A100 GPU.
In the full report, we estimate the number of GPUs OpenAI is renting from Microsoft to provision the API. We account for standard utilization rates of cloud services due to intra-day variation, and their provisioning for API growth.
This does not align with The Information’s claim that OpenAI is "basically operating at full capacity” with the 60k GPUs they say OpenAI rents from Microsoft for non-ChatGPT inference. In fact, at full utilization, we estimate OpenAI could serve all of its gpt-4o API traffic with less than 10% of those 60k GPUs.
If 60k is accurate, even if many are consumed by Dalle-3, Whisper, and fine-tuned models, at historical API growth rates (available in our full report), this is sufficient for growth well into 2025.
FutureSearch’s July OpenAI revenue report breaks down the evidence for the ~$500M ARR for the API business, not 1B as reported in the news. In fact, we expect monthly API revenue to drop significantly following the nearly 2x price reduction in the August 2024 model of GPT-4o.
This full report breaks down this June API traffic: how many input tokens those queries have, and how many output tokens they produce, and how quickly this traffic is growing.
These numbers show that, as customers switched from GPT-4-Turbo to the cheaper GPT-4o, and then again to the cheaper GPT-4o-2024-08-06, even with increased usage, they are likely spending less overall.
In the report, we postulate that while the ~2x price reduction from GPT-4-Turbo to GPT-4o came largely from cost savings (see next section), the ~2x price reduction from GPT-4o to GPT-4o-2024-08-06 is likely due to competition, coming straight from OpenAI’s profit margins rather than reflecting another breakthrough cost reduction.
Memory bandwidth, not FLOPs, is the bottleneck in how many GPT-4o tokens per second a GPU can produce. Due to memory constraints, not all compute capability is reached when running GPT-4 class models on A100 or H100 chips.
Part of the ~5x cost reduction between GPT-4-0314 in March 2023 and GPT-4o in June 2024 comes from better memory usage.
With GPT-4-0314, storing the vectors required to compute attention between tokens in a key-value (KV) cache, and reading the entire cache for each generated token, results in ballooning memory costs as the length of the context window increases.
GPT-4o (and other modern models) very likely use sparse attention, reducing storage of input-token-pair attention weights. This is almost mathematically necessary to support the now very large context windows the models have, and explains the much higher throughput we find than guesses from the media.
The full calculation, including how many GPT-4o tokens / second a single A100 can generate, is included in the full report.
The currently healthy, but steeply dropping, profit margins for serving LLMs over an API imply two scenarios.
Scenario one: competition intensifies. With Meta’s Llama, Google’s Gemini, and Anthropic’s Claude all comparable in quality, OpenAI may feel compelled to again drop their prices in half (or more!). With margins of >50%, they can do this while still not running at a loss.
A perfect competition model in any industry has no profit margins. Even in cloud computing, a trillion dollar industry, AWS might be one of the only profitable outfits.
Scenario two: quality differentiates. Opinions vary on the quality improvements between the GPT-4 models outlined in this article. In our experience at FutureSearch, GPT-4o is dramatically better at problem solving than GPT-4-0314. (But Claude-3.5-sonnet is still the best model for real-world workflows.)
GPT-5 and Claude-3.5-opus might come soon, served at higher costs, higher prices, and more intelligence. If labs produce LLMs with features others cannot match, they can charge a hefty price premium, and get a solid profit stream for an extended period of time.
Making an overall projection requires the full numbers in the report.