DevDay: OpenAI completes its toolbox for developers

At its Devday 2024 in San Francisco, OpenAI announced a collection of tools for developers and data scientists that leverage its large language models.

The startup first presented the Realtime API beta. As its name suggests, this should make it possible to build “low-latency multimodal conversational experiences”. This WebSocket interface can support text and audio files as input and output, as well as function calling.

Realtime API: GPT-4o gives voice

More particularly, it would make it possible to develop chatbots or voice agents, integrated into applications. According to OpenAI, developers do not have to convert audio to text (speech to text) as they can currently do with Whisper, AssemblyAI or Gladia. This “speech to speech” system directly returns an audio file containing the response of a model capable of changing its tone, laughing or even whispering. Six predefined voices are available.

This model currently has a barbaric name – gpt-4o-realtime-preview-2024-10-01 – but is derived from GPT 4-0. This technology is similar to what OpenAI offers with ChatGPT’s Advanced Voice mode.

Currently, the API supports 16-bit 24kHz PCM and 8kHz G.711 base-64 encoded audio files. “We will soon be working on supporting other audio codecs,” the documentation states. ‘OpenAI.

The model can respond to a query with an audio file, text, or both. OpenAI already imagines ingesting images and videos as input. It will increase the current usage limit (100 concurrent sessions maximum) and provide SDKs for developers.

The Chat Completions API will also allow responding with an audio file with the GPT-4o-audio-preview template.

GPT-4o mini will also be adapted to the Realtime API.

“The Realtime API uses both text and audio tokens. Text input tokens are priced at $5 per 1 million units, and output tokens are priced at $20 per 1 million,” OpenAI says. “Audio input is charged at $100 for 1 million tokens, and audio output at $200 for the same quantity of units. This works out to approximately $0.06 per minute of audio input and $0.24 per minute of audio output. Audio in the Chat Completions API will be the same price.”

OpenAI does not specify whether its service is available in Europe, but Azure says it hosts the LLM gpt-4o-realtime-preview-2024-10-01 in its Swedish cloud region (Central Sweden).

Furthermore, Twilio, the publisher of a “CPaaS” (Communication Platform-as-a-Service) will integrate it into its set of tools.

Here, OpenAI walks on the grounds of Kyutai, the AI research laboratory supported by Xavier Niel. A few months ago, the French startup presented Moshi, a conversational AI with similar capabilities.

Automated context caching

With the availability of Prompt Caching, the LLM provider intends to offer a capacity already offered by Google Cloud with Gemini 1.5 Pro and 1.5 Flash under the name “Context Caching”.

As with GCP, OpenAI caches user conversations and the prompts most used by developers. The promise? Reduce latency by up to 80% and the cost of the longest prompts by up to 50%.

The difference is that at OpenAI, this is not an option: “Prompt Caching works automatically on all your API requests (no code modification is necessary) and does not incur additional costs”.

This mode is currently not activated for all LLMs. It is enabled for GPT-4o, mini, 01-preview and 01-mini.

On the other hand, to benefit from this caching, it is necessary to structure the prompts well.

“Caches are only possible for exact prefix matches within a prompt,” the vendor’s documentation specifies. “ […]Place static content, like instructions and examples, at the beginning of your prompt, and place variable content, like user-specific information, at the end. This rule also applies to images and tools, which must be identical from one query to the next.

This caching starts from 1024 tokens and in increments of 128 tokens beyond.

The duration of this service is, however, relative. “Cache evictions typically occur after 5-10 minutes of inactivity, but can sometimes persist for up to an hour during off-peak periods,” the documentation reads. There is no manual cache cleaning function yet.

OpenAI insists that caching spaces are not shared between clients.

Turnkey fine-tuning

Two other APIs are dedicated to fine-tuning OpenAI LLMs.

First there is Model Distillation. This should help developers and data scientists simplify the process of fine-tuning a smaller model using the outputs of a larger LLM.

To do this, it involves building a testing procedure using the Evals module, available in beta via an API or a UI. OpenAI has several testing criteria, including the validity of the generated JSON or XML files, factuality, semantic quality, etc.

Then, it is necessary to generate synthetic data by calling an LLM like GPT-4o or 01-preview. These results are stored using the “store: true” option of the Chat Completions API.

This data can be used to fine-tune GPT-4o mini or other smaller LLMs before evaluating its results, again with the Evals module. OpenAI warns that this is an iterative process and that the operation will have to be repeated before obtaining a sufficient result for production.

OpenAI ensures that Model Distillation works with all of its LLMs, as long as the size scale is respected. While the generation of test data is billed at the same price as a traditional API call, it is possible to obtain up to seven free evaluations per week if users agree to share the results with OpenAI. “The information shared with us will be used to help us improve and evaluate our future models,” promises the supplier.

Until October 31, OpenAI is offering up to 2 million free tokens per day for GPT-4o mini and up to 1 million tokens daily for GPT-4o to test Model Distillation.

Finally, an API is dedicated to the fine visual tuning of GPT-4o. It would take just 100 images to improve the performance of multimodal LLM in understanding visual elements and accomplishing certain tasks. It has already been tested to improve the detection of traffic signs, extraction of structured data in PDF documents, or to generate code used to power a Web front end.

This fine-tuning of the VLM (Vision Language Model) can be done with a maximum of 50,000 examples containing up to 10 images each. Again, OpenAI is offering 1 million training tokens per day until October 31 for GPT-4o fine-tuner with images.

Beyond that, training and fine-tuning of GPT-4o “will cost $25 for 1 million tokens and inference, $3.75 for 1M input tokens and $15 for 1M output tokens “. As a reminder, the resolution of the images is used to calculate the number of tokens they occupy as input.

A resolutely commercial course

These incremental announcements take place in a particular context. Last week, Reuters revealed the ongoing restructuring of OpenAI. Previously governed by a non-profit board of directors, the company would fully embrace its commercial vocation in order to attract investors. The first board would not disappear, but would retain a minority share of the company. Sam Altman, co-founder and CEO of OpenAI could receive shares.

The same day the Reuters article was published, Mira Murati, technical director of OpenAI, announced her departure, as did Barret Zoph and Bob McGrew, two high-ranking researchers in the organization.

Speaking to the news agency, Sam Altman denied that these departures were due to the restructuring project. It has been considered for several months by the board of directors. A spokesperson quickly indicated: OpenAI plans to respect its commitments regarding responsible AI. Also last week, the company signed the European AI pact.

Realtime API: GPT-4o gives voice

Automated context caching

Turnkey fine-tuning

A resolutely commercial course

Related posts