July 10th, 2024

AI speech generator 'reaches human parity' – but it's too dangerous to release

Microsoft's VALL-E 2 AI speech generator replicates human voices accurately using minimal audio input. Despite its potential in various fields, Microsoft refrains from public release due to misuse concerns.

Read original article

AI speech generator 'reaches human parity' – but it's too dangerous to release

Microsoft has developed an AI speech generator named VALL-E 2 that can replicate human voices with high accuracy using just a few seconds of audio. The AI engine, based on neural codec language models, achieves human parity in zero-shot text-to-speech synthesis, producing speech comparable to human performance. VALL-E 2 incorporates features like Repetition Aware Sampling and Grouped Code Modeling to enhance speech quality and efficiency. Despite its capabilities, Microsoft has decided not to release VALL-E 2 to the public due to concerns about potential misuse, such as voice cloning and deepfake technology. The researchers behind VALL-E 2 suggest that the technology could have practical applications in education, entertainment, journalism, accessibility features, and more, but emphasize the need for protocols to ensure ethical use, especially when dealing with unseen speakers. The AI speech generator remains a research project with no immediate plans for public access.

Generating audio for video

Google DeepMind introduces V2A technology for video soundtracks, enhancing silent videos with synchronized audio. The system allows users to guide sound creation, aligning audio closely with visuals for realistic outputs. Ongoing research addresses challenges like maintaining audio quality and improving lip synchronization. DeepMind prioritizes responsible AI development, incorporating diverse perspectives and planning safety assessments before wider public access.

All web "content" is freeware

Microsoft's CEO of AI discusses open web content as freeware since the 90s, raising concerns about AI-generated content quality and sustainability. Generative AI vendors defend practices amid transparency and accountability issues. Experts warn of a potential tech industry bubble.

'Skeleton Key' attack unlocks the worst of AI, says Microsoft

Microsoft warns of "Skeleton Key" attack exploiting AI models to generate harmful content. Mark Russinovich stresses the need for model-makers to address vulnerabilities. Advanced attacks like BEAST pose significant risks. Microsoft introduces AI security tools.

24 comments

By @jd115 - 10 months

I'm old enough to remember some number of months ago when GPT2 was described as "too dangerous to release".

By @thevillagechief - 10 months

Ah, the old "it's too dangerous to release" marketing move. Why even tell us about it?

By @htrp - 10 months

https://arxiv.org/pdf/2406.05370

The model in question is Microsoft Vall-E2 without the click bait headline.

By @rentonl - 10 months

Of course, this technology must only stay in the hands of our trusted corporate overlords.

By @carapace - 10 months

https://arxiv.org/abs/2406.05370

> Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases

https://www.microsoft.com/en-us/research/project/vall-e-x/va...

> This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public.

If you go back and look at older cities they almost all have the same pattern: walls and gates.

I figure now that the Internet is a badlands roamed by robots pretending to be people as they attempt to rob you for their masters, we'll see the formation of cryptologically-secured enclaves. Maybe? Who knows?

At this point I'm pretty much going to restrict online communication to encrypted authenticated channels. (Heck, I should sign this comment, eh? If only as a performance?) Hopefully it remains difficult to build an AI that can guess large numbers. ;P

By @chx - 10 months

Things are progressing just as https://youtu.be/xoVJKj8lcNQ predicted.

> so 2024 will be the last human election and what we mean by that is not that it's just going to be an AI running as president in 2028 but that will really be although maybe um it will be you know humans as figureheads but it'll be Whoever greater compute power will win

We saw already AI voices influencing elections in India https://restofworld.org/2023/ai-voice-modi-singing-politics/

> AI-generated songs, like the ones featuring Prime Minister Narendra Modi, are gaining traction ahead of India’s upcoming elections. [...] Earlier this month, an Instagram video of Modi “singing” a Telugu love song had over 2 million views, while a similar Tamil-language song had more than 2.7 million. A Punjabi song racked up more than 17 million views.

By @bitshiftfaced - 10 months

Too dangerous for PR reasons at least until after November.

By @dspillett - 10 months

People saying “too dangerous to release” usually means one (or more) of three things:

1. “… but if you and your big rich company were to acquihire us you'd get access…” — though as this is MS it probably isn't that!

2. That is only works as well as claimed in specific circumstances, or has significant flaws, so they don't want people looking at it too closely just yet. The wordage “in benchmarks used by Microsoft” might point to this.

3. That a competitor is getting close to releasing something similar, or has just done so, and they don't want to look like they were in second place or too far behind.

By @abroadwin - 10 months

Relevant research post from Microsoft: https://www.microsoft.com/en-us/research/project/vall-e-x/va...

By @digitalsushi - 10 months

weakest link, if they dont release this, someone else will release one. every time someone noble invents another gen ai toy/weapon, they lock it down with post filters so it cant be used for evil, and then a second person forks it, pops the safeties off, and tells the world to go nuts.

social solutions take too long to use against the tech, but tech solutions are fallible. to be defeatist about it, there's going to be a golden window of time here where some really nasty scams have no impedance.

By @spywaregorilla - 10 months

I really want something that can do a voice change and match the emotion and articulation of a voice clip that I provide. I don't care (or want) it to be based off a real person and the manners in which they would tend to articulate a sentence. Are there any decent open models out there?

By @pphysch - 10 months

Speech generation has gotten really good, but there's simply no way to faithfully recreate someone's vocal idiosyncracies and cadence with just "a few seconds" of real audio. That's where the models tend to fall short.

By @ChrisArchitect - 10 months

Project page: https://www.microsoft.com/en-us/research/project/vall-e-x/va...

> "This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public."

By @exe34 - 10 months

I too have an agi in my basement, but it's too dangerous to release! wanna give me some cash?

By @coeneedell - 10 months

These samples are terrible when compared to commercially released models like from eleven labs or playht. This is an extension of an interesting architecture but currently those more traditionally based models are way more convincing.

By @42lux - 10 months

I can't wait until the free base models get better. The floods on tiktok, shorts and stories with the standard eleven labs voice is getting nauseating.

By @mensetmanusman - 10 months

A gun can help rob a bank.

A speech generator can help rob 1000 banks.

By @AlexDragusin - 10 months

"Too dangerous to release" - I could use that line to promote my services :)

By @mcpar-land - 10 months

I can believe a speech generator too good to release, but not even a perfect algorithm can get every one of your inflections and verbal tics with just a few seconds of sample material. Makes me think the whole thing is bs. I instantly see any "ooh our thing we are making on purpose is so dangerous oohhh" as an attempt at regulatory capture until I see proof of the danger.

By @hi_dang_ - 10 months

The classic Steven Seagal “my hands are weapons and I need a license for them” rhetoric. What a crock of shit.

By @Slyfox33 - 10 months

Riiiight

By @zazazache - 10 months

What is the point of them trying to create this? That something like this would mostly be used to create disinformation and create chaos is easily understood before making something like this.

Truly irresponsible

AI speech generator 'reaches human parity' – but it's too dangerous to release

Related

Generating audio for video

All web "content" is freeware

'Skeleton Key' attack unlocks the worst of AI, says Microsoft

Related

Generating audio for video

All web "content" is freeware

'Skeleton Key' attack unlocks the worst of AI, says Microsoft