During its announcement marathon in December, OpenAI particularly highlighted its features related to ChatGPT.
In fact, all of them depend on the evolution of its major language models and its tools. “Weapons” that the startup makes available to developers.
Starting with the general availability of the “reasoning” model collection 01, release of its forecast. In addition to function calls, this LLM would be able to generate structured content (including JSON schemas) and read images as input. Above all, OpenAI has taken measures to reduce usage costs by reducing the quantity of tokens for “reasoning”. In this sense, it delivers an API parameter called “reasoning effort”. “Reasoning efforts” are used to control both the duration of processing and the quantity of tokens necessary to execute a task. The startup also introduced an instruction system aimed at further guiding developer outcomes.
An arsenal of tools to exploit o1 and GPT-4o audio models
Features like Advanced Voice Mode are partly enabled by improvements to the Realtime Transcript API, which benefits from integration with WebRTC in JavaScript. It includes audio encoding and streaming management, noise suppression and bitrate control functions.
Still in beta, the Realtime API is accompanied by the availability of gpt-4o mini realtime preview, a variant of the model capable of processing text and audio, as well as gpt-4o audio and mini audio preview, both exclusively speech to speech models.
The API allowing you to interact with GPt4o by voice benefits from new parameters in order to parallelize content filtering processing and add context to generate the vocal response. It is also possible to postpone the answer to a question on the server side by gathering more information about the human interlocutor. The response can then be triggered manually or using a. Sessions can now last 30 minutes instead of 15 minutes previously.
Here, OpenAI says it has drastically reduced its pricing for its voice-to-voice models.
“The price of GPT-4o mini for audio is $10 per million input tokens and $20 per million output tokens,” says the supplier in its documentation. “Text tokens are priced at $0.60 per million input tokens and $2.40 per million output tokens. Cached audio and text files both cost $0.30 per million tokens.”
In addition, developers have beta rights to SDKs for Java and Go, in addition to the development kits for Python, .Net, TypeScript/JavaScript already available.
OpenAI pushes companies to fine-tune its models
Not to mention that OpenAI believes it can put developers into light fine-tuning with Preference fine-tuning techniques, and more particularly Direct Preference Optimization (DPO). This technique, usually used in the final training phases of an LLM, consists of providing a dataset comprising questions or prompts and answers compared by annotators. Some are ideal and others are “suboptimal” or less good (“A is better than B”). Then, response comparison labels allow the model parameters to be directly optimized to reduce the amount of unwanted output.
This method is to be distinguished from supervised fine-tuning which uses fixed labels, which is more appropriate for obtaining factual and more structured answers. “By learning from pairwise comparisons rather than fixed targets, Preference Fine-Tuning is particularly effective for subjective tasks where tone, style and creativity are important,” says OpenAI. Around ten samples would be enough to start getting results.
Speaking of reinforcement learning, OpenAI is launching a research program dedicated to “reinforcement fine-tuning” (RFT), that is to say a specific technique for refining the knowledge of LLMs through a training loop involving a reward model. THE OpenAI reward model, the key to its technical success.
Here, the startup intends to make available to major accounts and researchers a tool (in alpha) to specialize LLM o1 to accomplish “complex” tasks in their respective fields: finance, legal, medical research, etc., from a few thousand examples.
The datasets are composed of a case, a series of instructions and a “correct” answer, hidden from the LLM to be trained, but used by the reward model to evaluate the prediction of the LLM.
There are several reward models, here called “evaluator” depending on the task to be performed. According to OpenAI researchers, RFT training would take a few hours to several days depending on the amount of input data. The technique would make it possible to train an o1 mini model to make it more efficient than a generic LLM o1 concerning the accomplishment of an “expert” task such as determining the gene(s) responsible for a disease. The RFT stream will enter general availability next year.
OpenAI unveils o3 (so as not to offend Telefónica)
But the most anticipated launch of next year is none other than that of o3. “You might have thought this model would be called o2, but out of respect for our friends at Telefónica [propriétaire de l’opérateur et de la marque O², N.D.L.R] and in the great OpenAI tradition of being truly terrible at coming up with names, it will be called o3,” says Sam Altman, co-founder and CEO of OpenAI, in the latest video in the series of announcements.
These “reasoning” models trained with “chain of thought” techniques would already be better than o1 once confronted with programming, mathematics and fundamental research tasks.
“We are almost saturating the benchmarks available to us,” says Mark Chen, senior vice president of research at OpenAI. Hence the need to find and design ever more difficult comparisons. OpenAI is partnering with the ARC Foundation to develop a benchmark for o3 that would already reach unprecedented performance levels.
If ordinary people and businesses have to wait until next year to try the o3 and o3 mini collection, researchers specializing in adversarial techniques are invited to participate in the test program, open during the night from Friday to Saturday. In fact, OpenAI seeks to test the performance of its (new) method of aligning responses to human preferences, called deliberative alignment.
Here, it is a question of allowing the LLM to “deliberate” – namely to carry out reasoning to know whether a user’s request is acceptable or not, without displaying the ins and outs to the user. He must only receive the final answer. The process diverges significantly from approaches to reinforcement learning with human feedback, and constitutional AI developed by Anthropic.
“We train a model in two stages to ensure its usefulness and safety. First, the model is refined without safety-related data. Then, we generate a dataset where the chains of reasoning (Chain of Thoughts or CoTs) integrate the security specifications, thanks to enriched system prompts,” OpenAI indicates in a blog post.
Through supervised refinement (SFT), the model learns to reason based on security specifications. Reinforcement learning (RL) should improve the use of these reasonings, with a reward model aligned with security policies.
Input specifications and prompts are used to generate synthetic data. Which would eliminate the need for human annotations and make the fine-tuning loop easier.