MMAudio – A tool to synchronize audio and video | Artificial intelligence

Monday 23rd December 2024 03:16 PM

Researchers from the University of Illinois at Urbana-Champaign and Sony AI have just revealed a particularly impressive technical feat in the field of generative AI. Indeed, their new project called MMAudio allows you to automatically generate soundtracks perfectly synchronized with any video.

Unlike most existing solutions, which often just clumsily superimpose sound effects on videos, MMAudio innovates by adopting a multimodal approach combining text and video. The system analyzes visual content and textual descriptions in parallel to understand precisely what is happening on the screen and generate realistic and coherent sounds.

Here is a demo that shows videos without sound to which MMAudio was then applied:

The model was trained on several large datasets, including AudioSet, Freesound, VGGSound and AudioCaps. With this multi-modal training approach, MMAudio provides an excellent understanding of the relationships between image and sound, and produces results far superior to previous solutions.

For example, if you show a video of waves crashing onto a beach, with the description “sound of waves on a deserted beach,” MMAudio will generate a realistic surf sound, carefully synchronized with the movements of the water. It's not bad, right?

Despite its sophistication, installing MMAudio remains simple. Here's how to do it:

Clone the repository:

git clone https://github.com/hkchengrex/MMAudio.git

Install the dependencies:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade

Then install MMAudio:

cd MMAudio pip install -e .

And there you have it, you are ready to generate your first soundtracks!

The great strength of MMAudio lies in its speed since the system is capable of generating 8 seconds of audio in just 1.23 seconds on a modern GPU.

Several variants are offered to adapt to different uses:

16 kHz version for “standard” sounds
44.1 kHz version for CD-like quality
Different model sizes (S, M, L) depending on the power and level of detail desired

The prospects for using MMAudio extend to many areas. For example, in the video production sector, the system would automatically enrich content with appropriate soundscapes. The video game field could also benefit from the ability to generate reactive sounds that adapt to player actions. Virtual reality will probably also benefit from truly immersive audio environments.

However, like any artificial intelligence system, MMAudio has certain limitations. Human voice generation remains a major challenge and still requires significant improvement. Particularly complex musical sounds can vary in quality, and the system still struggles to reproduce some very specific sound concepts. According to the development team, these limitations can be overcome by enriching the training data.

Ethically, the creators of MMAudio have adopted a responsible approach. They chose to make the source code open and transparent, accompanied by exhaustive documentation detailing the capabilities and limitations of the system. Extensive testing was carried out before publication, and particular attention was paid to compliance with the licenses associated with the datasets used.

Now if you want to add sound to your videos, an online demo is available here.

Thanks to Lorenper for sharing this discovery!

Source