Trained on over 50,000 hours of audio data, Voicebox delivers voice synthesis up to 20 times faster than older autoregressive models. It can produce lifelike, multilingual speech without needing any prior training data in a specific voice, making it one of the most advanced and versatile zero-shot TTS (text-to-speech) models developed to date.
However, this power comes with serious risks. Because it only needs a couple of seconds of someone’s voice to create a near-perfect clone, concerns about deepfake audio, misinformation, scams, and identity theft have already been raised. Due to these ethical and security concerns, Meta has chosen not to publicly release the model for now. Instead, it has developed a tool to help detect and distinguish AI-generated voices from real ones.
Voicebox is a glimpse into the future of voice AI — both exciting and unsettling. As with any powerful technology, its potential uses span from accessibility breakthroughs and creative storytelling to manipulative deepfakes. The conversation now turns to responsibility, regulation, and how we, as a society, handle such transformative tools.