July 12th, 2024

StreamVC: Real-Time Low-Latency Voice Conversion

StreamVC is a real-time voice conversion technology by Google, preserving speech content and prosody while enabling low-latency waveform generation for real-time communication applications like calls and voice anonymization.

Read original article

StreamVC: Real-Time Low-Latency Voice Conversion

The article discusses StreamVC, a real-time low-latency voice conversion technology developed by a team at Google. StreamVC aims to convert the voice timbre from any target speech while preserving the content and prosody of the source speech. Unlike previous methods, StreamVC can generate the resulting waveform with low latency, even on mobile platforms, making it suitable for real-time communication applications like calls and video conferencing. The technology also addresses scenarios such as voice anonymization. The design of StreamVC is based on the SoundStream neural audio codec, enabling lightweight high-quality speech synthesis. The system demonstrates the ability to learn soft speech units causally and effectively utilize whitened fundamental frequency information to enhance pitch stability without compromising the source timbre information. This advancement falls within the research areas of Speech Processing and Machine Intelligence at Google, showcasing the company's commitment to innovation in these fields.

Video annotator: a framework for efficiently building video classifiers

The Netflix Technology Blog presents the Video Annotator (VA) framework for efficient video classifier creation. VA integrates vision-language models, active learning, and user validation, outperforming baseline methods with an 8.3 point Average Precision improvement.

Generating audio for video

Google DeepMind introduces V2A technology for video soundtracks, enhancing silent videos with synchronized audio. The system allows users to guide sound creation, aligning audio closely with visuals for realistic outputs. Ongoing research addresses challenges like maintaining audio quality and improving lip synchronization. DeepMind prioritizes responsible AI development, incorporating diverse perspectives and planning safety assessments before wider public access.

We increased our rendering speeds by 70x using the WebCodecs API

Revideo, a TypeScript framework, boosted rendering speeds by 70 times with WebCodecs API. Challenges overcome by browser-based video encoding. Limited audio processing and browser compatibility remain.

Introducing Voice Isolator and Background Noise Remover

The website offers a Voice Isolator tool for extracting clear speech by removing background noise from audio. Users can try a sample, access FAQs, and explore other AI audio solutions by ElevenLabs.

AI speech generator 'reaches human parity' – but it's too dangerous to release

Microsoft's VALL-E 2 AI speech generator replicates human voices accurately using minimal audio input. Despite its potential in various fields, Microsoft refrains from public release due to misuse concerns.

11 comments

By @coldblues - 10 months

https://github.com/hrnoh24/stream-vc

https://github.com/yuval-reshef/StreamVC

Unofficial implementations of StreamVC

By @huac - 10 months

The samples were released a while back: https://google-research.github.io/seanet/stream_vc/

By @judiisis - 10 months

What is the current best Foss(or otherwise) implementation for voice changer/anonymiser?

By @udev4096 - 10 months

Actual paper: https://arxiv.org/pdf/2401.03078

By @manishsharan - 10 months

Are there any use cases that is driving this ? Is there a huge burning need for technology ?

Are kidnappers and con-men a huge under-served market that Google is hoping to serve ? Deep Fake videos not convincing enough to serve the need of fraudsters ?

I am totally against regulating AI but shit like this gives fodder to the other side.

By @gnat - 10 months

From the poster:

In this work, we propose a light-weight (~20M param.) causal voice conversion solution that can run in real-time with low latency on a commercially available mobile device. The key design elements are: (1) using a causal encoder to learn soft speech units; (2) injecting whitened f0 to improve pitch stability without leaking source speaker info.

In our later V2 version, we found that f0 rescaling followed by a NSF-style harmonic-plus-noise conditioning (as is done in RVC) results in better quality.

By @froglus - 10 months

is it like discord or just voice chat, because i like to have things twice!!

By @neilk - 10 months

What are the anticipated use cases?

I know of one: transgender people often would like to alter the timbre of their voice and spend a lot of time training their voice. At least for online scenarios, this can just do it.

But other than that AI voice altering research seems like it benefits mostly scammers? I’m just wondering what they tell themselves they’re doing. I didn’t see this in the paper.

Video annotator: a framework for efficiently building video classifiers

Generating audio for video

We increased our rendering speeds by 70x using the WebCodecs API

Introducing Voice Isolator and Background Noise Remover

The website offers a Voice Isolator tool for extracting clear speech by removing background noise from audio. Users can try a sample, access FAQs, and explore other AI audio solutions by ElevenLabs.

StreamVC: Real-Time Low-Latency Voice Conversion

Related

Video annotator: a framework for efficiently building video classifiers

Generating audio for video

We increased our rendering speeds by 70x using the WebCodecs API

Introducing Voice Isolator and Background Noise Remover

AI speech generator 'reaches human parity' – but it's too dangerous to release

Related

Video annotator: a framework for efficiently building video classifiers

Generating audio for video

We increased our rendering speeds by 70x using the WebCodecs API

Introducing Voice Isolator and Background Noise Remover

AI speech generator 'reaches human parity' – but it's too dangerous to release