OpenAI concludes its 12 days of announcements previewing its next frontier models, o3 and o3-mini and opening early access to safety and security researchers.
The company invites applications from the research community to explore and test these systems during its testing process prior to any public release. To find applications, which open today, you can visit the OpenAI blog post.
During the livestream presentation, OpenAI shared early evaluations of o3 and o3-mini to illustrate their performance compared to o1 and o1-mini. While these are early versions of the models and the final results may evolve with further post-training, the company is excited about their progress, particularly on mathematical benchmarks and new security techniques.
Below are some highlights highlighted by OpenAI:
- High coding performance: o3 outperforms o1 by 22.8 percentage points on SWE-Bench Verified and achieves a Codeforces score of 2727, surpassing OpenAI's Chief Scientist score of 2665.
- Mathematics and science: o3 scores 96.7% in AIME 2024, missing just one question, and achieves 87.7% in GPQA Diamond, well above the performance of human experts.
- Frontier benchmarks:o3 sets new records in the hardest known assessments, solving 25.2% of problems on EpochAI's Frontier Math, where no other model tops 2%. In the ARC-AGI test, o3 more than triples o1's low compute score and surpasses 85% (verified by the ARC Prize team, live at 10am PST), a milestone in conceptual reasoning skills.
In parallel, OpenAI is publishing new research on deliberative alignment, a cutting-edge technique that was central to the alignment of o1, the company's most robust and aligned model to date.
As AI capabilities advance, the OpenAI team points out, so does the opportunity to improve their security and ensure rigorous alignment. This also gave rise to the choice to share this work with the research community and to collaborate in the testing of o3 and o3-mini.