Meta’s Audiobox

Audiobox is Meta’s foundational research model for audio generation, capable of creating voices and sound effects using voice inputs and natural language text prompts. This innovation simplifies custom audio creation across various applications.

The Evolution from Voicebox to Audiobox

Building on the success of Voicebox, a state-of-the-art AI model for speech generation, Audiobox advances audio generative AI further. It integrates generation and editing of speech, sound effects, and soundscapes, offering diverse input mechanisms for increased controllability.

Capabilities of Audiobox

Audiobox users can describe a sound or speech type using natural language prompts. The model enables dual input (voice and text description prompts) for versatile voice restyling, a first in the field. It shows superior controllability in speech and sound effect generation, outperforming previous models in quality and relevance.

The Purpose Behind Audiobox

Audio is crucial in various media forms, but its production is complex, often needing extensive libraries and deep expertise. Audiobox is released to select researchers and institutions to enhance audio generation research and address responsible AI aspects. This initiative aims to democratize audio creation, making it accessible for all, from professionals to hobbyists.

Audiobox’s Technological Features

Apart from generating a wide range of sounds, Audiobox inherits Voicebox’s audio generation principles and flow-matching modeling for audio infilling. This allows users to enhance soundscapes, adding elements like thunder sounds to a rain soundscape.

Responsible Research and Collaboration

Meta emphasizes the importance of responsible AI use in audio generation. Audiobox is released under a research-only license to selected researchers for advancing safe and ethical AI practices. The model includes automatic audio watermarking for origin traceability and voice authentication in its interactive demo to prevent impersonation.

Future Directions for Audiobox

The long-term goal is to transition from specialized to generalized audio generative models, enabling a broader range of applications. Audiobox is a significant step toward this future, potentially transforming fields like content creation, narration, sound editing, game development, and AI chatbots.

Additional information

