Microsoft VALL-E: This new AI tool can simulate any voice in just three seconds
Microsoft Corporation has unveiled VALL-E, a new text-to-speech artificial intelligence (AI) tool which can accurately mimic a person’s voice with a three-second audio sample. Once familiar with a particular voice, VALL-E can create sounds of that person speaking in any situation while aiming to capture their emotional tone.
VALL-E’s developers believe it can be used for high-quality text-to-speech applications, speech editing, where a recording of a person can be edited and changed from a text transcript (making them say something they did not say), and audio content creation when combined with other generative AI models like GPT-3.
Microsoft used the LibriLight audio library from Meta to train the speech synthesis capabilities of VALL-E. Over 7,000 speakers contributed 60,000 hours of English-language voice, mostly from LibriVox public domain audiobooks. For VALL-E to perform well, the voice in the three-second sample must sound much like the voice in the learning algorithm.
Microsoft VALL-E based on EnCodec by Meta
Based mainly on EnCodec, which Meta unveiled in October 2022, VALL-E is a “neural codec language model.” Unlike previous text-to-speech techniques, which typically synthesize speech by altering waveforms, VALL-E generates distinct audio codec codes from text and acoustic stimuli.
It analyses how someone sounds, utilizes EnCodec to separate the pertinent information into discrete parts (known as “tokens”) and uses training data to compare what it “knows” about how the voice may sound if it spoke more than the three-second sample.
Concerns exist, though, regarding the ethical implications of this technological solution. The voices produced by VALL-E and comparable technology will sound more convincing, opening the door for spam calls that realistically replicate the sounds of real persons a potential victim knows.