The first of the “12 days of OpenAI” was marked by the announcement of ChatGPT Pro and the OpenAI o1 model, previously in preview. On December 20, the start-up closed the series in style by unveiling the o3 family, still in the testing phase, equipped with significantly improved reasoning capabilities.
OpenAi o3 is the improved version of its predecessor o1.O2 being the name of the largest mobile network operator in the United Kingdom, OpenAI preferred to ignore o2 to avoid any conflict.
The advanced reasoning capabilities of OpenAI o1 and o3 are based on chain of thought and reinforcement learning (RL). Chain thinking is a process where AI breaks down a complex problem into simpler sub-problems before coming up with a solution. Using reinforcement learning, models perfect this chain of thought and refine their strategies. They learn to identify and correct their mistakes, break down complex steps into simpler steps, and try a different approach when the current approach doesn't work.
The ability to break a problem down into steps and attempt to critically judge the answers provided in order to simulate human thinking is particularly useful in areas requiring lengthy and nuanced reasoning:
- Complex mathematics : Multi-step problem solving, where a simple calculation or logic error in one step can compromise the final result;
- Programming : Analysis of errors in a code and development of an optimal solution after considering several options;
- Sciences : Understanding and application of complex scientific theories to resolve multifaceted questions.
The OpenAI o3 family
The new family has two models: o3 and o3-mini, a faster and more economical version, offering similar capabilities at a fraction of the cost and latency of the o1-mini model which OpenAI says excels in STEM, in particular. math and coding.
o3 performance
On the “Abstract and Reasoning Corpus for Artificial General Intelligence” (ARC-AGI) benchmark developed by the creator of Keras, the open source deep learning library, François Chollet, in order to measure the effectiveness of skills acquisition by AI on unknown tasks, o3 achieved the best scores: it not only tripled the performance of o1 with a score of 75.7% on a semi-private evaluation where computing resources are limited, but achieved that of 87.5% on the high calculation setting.
François Chollet points out, however, that o3 could see its score potentially reduced to less than 30% during the next ARC-AGI-2 benchmark, planned for 2025, even at high calculation (while an intelligent human would still be able to score more 95% without training).
On other benchmarks, such as GPQA Diamond in mathematics, EpochAI's Frontier Math Benchmark or Codeforces, it also demonstrated impressive performance.
Safer models
OpenAI says it has developed a new security technique for its “o” models called “deliberative alignment” that uses the models’ reasoning capabilities to better identify safe and unsafe prompts, thereby improving their security performance.
Towards the IAG?
If the model brings OpenAI a little closer to AGI, it still struggles to solve simple tasks. It is furthermore expensive, requiring substantial computing resources to achieve its optimal performance: around $20 per task for low computing power, several thousand per task for high computing configurations, which may hamper its large-scale deployment .
OpenAI has opened early access to the o3 and o3-mini models for security researchers to test the models before their public launch, scheduled for late January for o3-mini and shortly after for o3.