The new o3 family of models was presented as part of the “12 Days of OpenAI” conference which, as the name suggests, lasted a whopping 12 days. The company also waited until the very end of its marathon to announce what was being kept in reserve as the highlight of the show. For OpenAI, the announcement is all the more important as it allows us to address two current issues: the reliability of responses and the race for AGI.
Besides, why o3 and not o2? According to The Information, OpenAI feared a possible conflict with the operator O2 in the United Kingdom. According to TechCrunch, the CEO of OpenAI himself, Sam Altman, confirmed it half-heartedly on Friday. We are still surprised that the name o1 could have been chosen knowing that o2 was going to pose a problem.
What is o3?
The family currently consists of two models, o3 and o3-mini. The latter will be the first to come out. It is expected in January, while the standard o3 model is expected to arrive later in the year, with no further details yet.
We may be surprised at the proximity between the actual release of o1 a few weeks ago (at the same time as a ChatGPT Pro offer at $200 per month) and the presentation of o3. But OpenAI had a lot to say about its model, since o1 represented a first step toward “reasoning” and progress was rapid, according to the company.
o3 is thus capable of verifying itself, to a certain extent. He uses a technique called “ deliberative alignment ”, about which we have little information. It must allow the model to avoid some of the pitfalls usually affecting large language models, including their relative inability to ensure the reliability of information.
These verification steps cause latency. Depending on the query, the response can arrive with a delay of a few seconds to several minutes compared to a classic LLM. This was already the case with o1, but o3 intensifies this aspect. As a result, the new model must be more reliable in areas such as mathematics, physics and more generally science.
o3 therefore pauses before responding and takes time to think. According to OpenAI, this time is spent on calculations on related questions, explaining (internally) the development of the answer. The proposed response is then a summary of what o3 considers to be the most accurate response.
Improvements since o1
Announced just three months ago, o1 was OpenAI's first reasoning model. His training had been done with reinforcement learning (RL). o3 takes up this idea, but with much more developed reinforcement, even if again OpenAI does not give any precise figures. “ The strength of the resulting model is very, very impressive », Declared on X Nat McAleese, researcher at OpenAI.
Compared to o1, o3 also makes it possible to adjust the reasoning time, and therefore the verification of the answers. Three notches are available: low, medium or high. The higher the level, the more precise the results and the longer they take to display. Conversely, we can cut corners on precision to obtain quick answers. But be careful, because even with greater precision, OpenAI is careful not to say that its model does not make errors, simply that it makes less than o1.
The company still gave a series of results to compare the new model to the old one on certain benchmarks:
- ARC-AGI Semi-Private Evaluation : 75,7 %
- EpochAI Frontier Math: 25.2% of problems solved, compared to 2% for “other models”
- SWE-Bench Verified: 71.7 points, compared to 48.9 for o1
- Codeforces: Elo score of 2,727
- AIME 2024: 96.7%, compared to 83.3% for o1
- GPQA Diamond: 87.7%, compared to 78% for o1
“A significant step forward”
In a tweet on Friday, François Chollet, creator of Keras and ARC-AGI, indicated that o3 “ represents a significant step forward in adapting AI to new tasks “. He said the score of 75.7% was in low calculation mode, a cost of $20 per calculation task. In the highest level, the score rises to 87.5%, but the calculation then represents “ thousands of dollars per task ».
ARC-AGI was created to assess the extent to which an artificial intelligence system can acquire new skills outside of the data on which it was trained.
In a dedicated post, the ARC Prize association emphasizes the important step that o3 represents and provides some elements of comparison. It took four years to go from a score of 0% with GPT-3 in 2020 to 5% with GPT-4o. ARC Prize aims to be “ a pole star towards the IAG », she notes that the next tests will have to take O3 into account.
The 2025 version of the benchmark will therefore be more difficult. She adds that o3’s performance proves “ that architecture is essential ”, because GPT-4 could not have achieved these results by simply increasing the power. “ It's not enough to scale like we did from 2019 to 2023 », adds ARC Prize.
And what’s next?
o3's presentation was made three months after that of o1. According to OpenAI, this pace should be preserved. The question, however, is whether the company will be able to refine its model in order to maintain the same margin for progress.
More importantly, can we say that o3 is a milestone towards artificial general intelligence? It's not that obvious. On the one hand, the results and performances of the model seem to clearly point in this direction. On the other hand, does this type of model necessarily represent the future? Because if precision takes a step forward, reinforcement learning and inference are much more expensive than with more traditional LLMs like the GPT series.
These are not necessarily outdated for the moment. Even though o models have the advantage of accuracy, GPT keeps the advantage of cost and low latency. However, it is difficult to draw generalities, because only the o1 model is actually available. We will have to wait until January to see o3-mini arrive, and we will then see the first “real” results, since only those from OpenAI are provided for the moment.
Finally, remember that OpenAI is not alone on the market for “reasoning” models. Less than two weeks ago, Google presented its Gemini 2.0 Flash Thinking Experimental, which can be tested via AI Studio. A few days later, a Chinese company did the same with a model called DeepSeek-R1.
One of the most interesting aspects of this new wave is that it confirms the wall of complexity for LLMs, which we discussed with data scientist Cécile Hannotte. Adding compute layers and GPUs is not enough, other approaches are needed. Reasoning models are one, but not necessarily as big a step towards AGI as OpenAI says. The evolution of performance will therefore need to be monitored closely.