July 10th, 2024

AI speech generator 'reaches human parity' – but it's too dangerous to release

Microsoft's VALL-E 2 AI speech generator replicates human voices accurately using minimal audio input. Despite its potential in various fields, Microsoft refrains from public release due to misuse concerns.

Read original articleLink Icon
AI speech generator 'reaches human parity' – but it's too dangerous to release

Microsoft has developed an AI speech generator named VALL-E 2 that can replicate human voices with high accuracy using just a few seconds of audio. The AI engine, based on neural codec language models, achieves human parity in zero-shot text-to-speech synthesis, producing speech comparable to human performance. VALL-E 2 incorporates features like Repetition Aware Sampling and Grouped Code Modeling to enhance speech quality and efficiency. Despite its capabilities, Microsoft has decided not to release VALL-E 2 to the public due to concerns about potential misuse, such as voice cloning and deepfake technology. The researchers behind VALL-E 2 suggest that the technology could have practical applications in education, entertainment, journalism, accessibility features, and more, but emphasize the need for protocols to ensure ethical use, especially when dealing with unseen speakers. The AI speech generator remains a research project with no immediate plans for public access.

Link Icon 24 comments
By @jd115 - 3 months
I'm old enough to remember some number of months ago when GPT2 was described as "too dangerous to release".
By @thevillagechief - 3 months
Ah, the old "it's too dangerous to release" marketing move. Why even tell us about it?
By @htrp - 3 months
https://arxiv.org/pdf/2406.05370

The model in question is Microsoft Vall-E2 without the click bait headline.

By @rentonl - 3 months
Of course, this technology must only stay in the hands of our trusted corporate overlords.
By @carapace - 3 months
https://arxiv.org/abs/2406.05370

> Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases

https://www.microsoft.com/en-us/research/project/vall-e-x/va...

> This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public.

If you go back and look at older cities they almost all have the same pattern: walls and gates.

I figure now that the Internet is a badlands roamed by robots pretending to be people as they attempt to rob you for their masters, we'll see the formation of cryptologically-secured enclaves. Maybe? Who knows?

At this point I'm pretty much going to restrict online communication to encrypted authenticated channels. (Heck, I should sign this comment, eh? If only as a performance?) Hopefully it remains difficult to build an AI that can guess large numbers. ;P

By @chx - 3 months
Things are progressing just as https://youtu.be/xoVJKj8lcNQ predicted.

> so 2024 will be the last human election and what we mean by that is not that it's just going to be an AI running as president in 2028 but that will really be although maybe um it will be you know humans as figureheads but it'll be Whoever greater compute power will win

We saw already AI voices influencing elections in India https://restofworld.org/2023/ai-voice-modi-singing-politics/

> AI-generated songs, like the ones featuring Prime Minister Narendra Modi, are gaining traction ahead of India’s upcoming elections. [...] Earlier this month, an Instagram video of Modi “singing” a Telugu love song had over 2 million views, while a similar Tamil-language song had more than 2.7 million. A Punjabi song racked up more than 17 million views.

By @bitshiftfaced - 3 months
Too dangerous for PR reasons at least until after November.
By @dspillett - 3 months
People saying “too dangerous to release” usually means one (or more) of three things:

1. “… but if you and your big rich company were to acquihire us you'd get access…” — though as this is MS it probably isn't that!

2. That is only works as well as claimed in specific circumstances, or has significant flaws, so they don't want people looking at it too closely just yet. The wordage “in benchmarks used by Microsoft” might point to this.

3. That a competitor is getting close to releasing something similar, or has just done so, and they don't want to look like they were in second place or too far behind.

By @abroadwin - 3 months
By @digitalsushi - 3 months
weakest link, if they dont release this, someone else will release one. every time someone noble invents another gen ai toy/weapon, they lock it down with post filters so it cant be used for evil, and then a second person forks it, pops the safeties off, and tells the world to go nuts.

social solutions take too long to use against the tech, but tech solutions are fallible. to be defeatist about it, there's going to be a golden window of time here where some really nasty scams have no impedance.

By @spywaregorilla - 3 months
I really want something that can do a voice change and match the emotion and articulation of a voice clip that I provide. I don't care (or want) it to be based off a real person and the manners in which they would tend to articulate a sentence. Are there any decent open models out there?
By @pphysch - 3 months
Speech generation has gotten really good, but there's simply no way to faithfully recreate someone's vocal idiosyncracies and cadence with just "a few seconds" of real audio. That's where the models tend to fall short.
By @ChrisArchitect - 3 months
Project page: https://www.microsoft.com/en-us/research/project/vall-e-x/va...

> "This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public."

By @exe34 - 3 months
I too have an agi in my basement, but it's too dangerous to release! wanna give me some cash?
By @coeneedell - 3 months
These samples are terrible when compared to commercially released models like from eleven labs or playht. This is an extension of an interesting architecture but currently those more traditionally based models are way more convincing.
By @42lux - 3 months
I can't wait until the free base models get better. The floods on tiktok, shorts and stories with the standard eleven labs voice is getting nauseating.
By @mensetmanusman - 3 months
A gun can help rob a bank.

A speech generator can help rob 1000 banks.

By @AlexDragusin - 3 months
"Too dangerous to release" - I could use that line to promote my services :)
By @mcpar-land - 3 months
I can believe a speech generator too good to release, but not even a perfect algorithm can get every one of your inflections and verbal tics with just a few seconds of sample material. Makes me think the whole thing is bs. I instantly see any "ooh our thing we are making on purpose is so dangerous oohhh" as an attempt at regulatory capture until I see proof of the danger.
By @hi_dang_ - 3 months
The classic Steven Seagal “my hands are weapons and I need a license for them” rhetoric. What a crock of shit.
By @Slyfox33 - 3 months
Riiiight
By @zazazache - 3 months
What is the point of them trying to create this? That something like this would mostly be used to create disinformation and create chaos is easily understood before making something like this.

Truly irresponsible