The Register: This open text-to-speech model needs just seconds of audio to clone your voice

Feb 16, 2025

—

Source URL: https://www.theregister.com/2025/02/16/ai_voice_clone/
Source: The Register
Title: This open text-to-speech model needs just seconds of audio to clone your voice

Feedly Summary: El Reg shows you how to run Zypher’s speech-replicating AI on your own box
Hands on Palo Alto-based AI startup Zyphra unveiled a pair of open text-to-speech (TTS) models this week said to be capable of cloning your voice with as little as five seconds of sample audio. In our testing, we generated realistic results with less than half a minute of recorded speech.…

AI Summary and Description: Yes

Summary: The text discusses the launch of Zonos, a text-to-speech model developed by AI startup Zyphra that is capable of voice cloning using minimal sample audio. The model showcases innovative architectural combinations, and raises ethical concerns surrounding its potential misuse, as well as benefits for accessibility.

Detailed Description:

– **Overview of Zonos**:
– Zyphra, a Palo Alto-based AI startup, released Zonos, an innovative text-to-speech (TTS) technology that can clone voices using just five seconds of audio.
– The company was founded in 2021, with a focus on developing a multimodal agent known as MaiaOS.

– **Technical Details**:
– Zonos comprises two models: one using a fully transformer-based architecture and the other a hybrid model that combines transformer and Mamba state space model architectures.
– Each model contains 1.6 billion parameters and was trained on more than 200,000 hours of diverse speech data, primarily in English, but also includes many other languages.

– **Deployment and Testing**:
– Users can test the models via a demo environment or deploy them locally with compatible hardware (Nvidia GPUs recommended).
– The model’s performance included producing about two seconds of audio for every second of runtime on an RTX 4090 GPU.
– Initial tests showed realistic voice cloning capabilities, achieving results indistinguishable from actual human speech on short clips.

– **Implications of Voice Cloning Technology**:
– **Ethical Concerns**: The technology has potential for misuse, including scams or creating misleading audio to damage reputations.
– **Positive Applications**: Benefits lie in accessibility, helping individuals recover their voices post-trauma or illness.
– The article encourages responsible use of voice cloning capabilities in light of its risk factors.

– **User Experience**:
– The deployment process is straightforward for those familiar with Linux and Docker, promoting easy access for a range of users.
– The availability of these models, especially in open-source formats, emphasizes the balance between innovation and ethics in AI technologies.

Overall, the text highlights significant advancements in generative AI voice technology while stressing the need for thoughtful application in security and ethical frameworks within compliance practices.

1 2 4 5 a access accessibility Act advancement advancements agent AI AI technologies air and Application applications Arch architectural architecture architectures art as audio availability based by C capabilities CERN CIA CLIP cloning compliance compliance practices concerns D data de demo deployment Docker dual e end environment ethical ethical concerns ethical framework ethical frameworks Ethics exp experience fact for framework frameworks full g Gen generated generative Generative AI GIS GPU GPUs hands hardware high Highlight HR http HTTPS human hybrid implications in innovation inux ite J Just k l language led Linux local man mini misuse modal model model architecture model architectures models multi Multimodal nation no Nvidia NVIDIA GPUs o of on one open open-source out over Palo Alto parameter performance post potential process R rag rate RCE real release reputation responsible responsible use Risk risk factors Ro s scam scams sec security SHA short Sig source Speech SSE start startup state T Tails tech technical details technologies technology test Testing text text-to-speech the Thought Time to Tor TP transformer Transformer-based two UI up US use user user experience Users V voice voice cloning voice cloning technology voice technology Well Wi x Zyphra