DayFR Euro

Kyutai lifts the veil a little more on its conversational AI Moshi

In early July, the French artificial intelligence research laboratory set the scene with a demonstration of its Moshi assistant, which has a highly responsive voice synthesis function and can faithfully reproduce human intonations. Today, the lab is giving some additional details on how this assistant was developed. It is also an opportunity to compare Helium, a multimodal language model that has 7 billion parameters and which is behind Moshi, equivalent models.

Moshi is a speech-to-text foundation model equipped with a full-duplex spoken dialogue framework. It uses Mimi, a streaming neural audio codec capable of processing 24 kHz audio, down to 12.5 Hz representation with 1.1 kbps bandwidth, in a fully streaming manner (80 ms latency).

Two audio streams managed simultaneously

Moshi models two audio streams: one corresponding to Moshi, and one corresponding to the user. During inference, the user’s stream is sampled from the audio input, and Moshi’s stream is sampled from the model’s output. Moshi then predicts text tokens corresponding to its own speech, which greatly improves the quality of its generation.

Kyutai researchers report that a Depth Transformer models inter-codebook dependencies for a given time step, while a large 7 billion parameter Temporal Transformer models temporal dependencies. Moshi achieves a theoretical latency of 160ms (80ms for Mimi’s frame size + 80ms acoustic delay), with practical overall latency as low as 200ms on an L4 GPU.

Helium, a compact and high-performance model

After a pre-training phase on large-scale audio, the researchers created synthetic conversations with their own models: Helium writes scripts, which the multi-stream TTS then converts into full-duplex conversations. In total, 20,000 hours of data were created with varying recording conditions and accents for the user, while keeping Moshi’s voice constant. “This makes it robust to noisy environments, whilst ensuring Moshi stays in character”assures Kyutai.

Extensive evaluations for Helium, Mimi and Moshi were conducted by the lab to ensure quality, audio language modeling and oral question responses. “We find that Moshi significantly outperforms previously published models, while having the unique ability to model full-duplex streaming conversations.”

Helium notably scores higher than the Falcon, Llama 2 and OLMo models on benchmark tests such as those related to reasoning (MMLU), scientific reasoning (ARC). On the other hand, its results remain below those of the Mistral AI and Google models.

Three models published under very permissive license

To date, the lab has released three different models: the Mimi voice codec, a male synthetic voice called Moshiko, and a female synthetic voice called Moshika, both refined from Moshi’s demo. Both voices were generated artificially.

All of these designs have been released under the CC BY 4.0 license. This license allows others to distribute, remix, modify, and build upon these designs, even for commercial purposes, as long as they give credit to Kyutai for the original creation. This is the most flexible license offered.

Selected for you

-

Related News :