StreamVC: Real-Time Low-Latency Voice Conversion
StreamVC is a real-time voice conversion technology by Google, preserving speech content and prosody while enabling low-latency waveform generation for real-time communication applications like calls and voice anonymization.
Read original articleThe article discusses StreamVC, a real-time low-latency voice conversion technology developed by a team at Google. StreamVC aims to convert the voice timbre from any target speech while preserving the content and prosody of the source speech. Unlike previous methods, StreamVC can generate the resulting waveform with low latency, even on mobile platforms, making it suitable for real-time communication applications like calls and video conferencing. The technology also addresses scenarios such as voice anonymization. The design of StreamVC is based on the SoundStream neural audio codec, enabling lightweight high-quality speech synthesis. The system demonstrates the ability to learn soft speech units causally and effectively utilize whitened fundamental frequency information to enhance pitch stability without compromising the source timbre information. This advancement falls within the research areas of Speech Processing and Machine Intelligence at Google, showcasing the company's commitment to innovation in these fields.
Related
Video annotator: a framework for efficiently building video classifiers
The Netflix Technology Blog presents the Video Annotator (VA) framework for efficient video classifier creation. VA integrates vision-language models, active learning, and user validation, outperforming baseline methods with an 8.3 point Average Precision improvement.
Generating audio for video
Google DeepMind introduces V2A technology for video soundtracks, enhancing silent videos with synchronized audio. The system allows users to guide sound creation, aligning audio closely with visuals for realistic outputs. Ongoing research addresses challenges like maintaining audio quality and improving lip synchronization. DeepMind prioritizes responsible AI development, incorporating diverse perspectives and planning safety assessments before wider public access.
We increased our rendering speeds by 70x using the WebCodecs API
Revideo, a TypeScript framework, boosted rendering speeds by 70 times with WebCodecs API. Challenges overcome by browser-based video encoding. Limited audio processing and browser compatibility remain.
Introducing Voice Isolator and Background Noise Remover
The website offers a Voice Isolator tool for extracting clear speech by removing background noise from audio. Users can try a sample, access FAQs, and explore other AI audio solutions by ElevenLabs.
AI speech generator 'reaches human parity' – but it's too dangerous to release
Microsoft's VALL-E 2 AI speech generator replicates human voices accurately using minimal audio input. Despite its potential in various fields, Microsoft refrains from public release due to misuse concerns.
https://github.com/yuval-reshef/StreamVC
Unofficial implementations of StreamVC
Are kidnappers and con-men a huge under-served market that Google is hoping to serve ? Deep Fake videos not convincing enough to serve the need of fraudsters ?
I am totally against regulating AI but shit like this gives fodder to the other side.
In this work, we propose a light-weight (~20M param.) causal voice conversion solution that can run in real-time with low latency on a commercially available mobile device. The key design elements are: (1) using a causal encoder to learn soft speech units; (2) injecting whitened f0 to improve pitch stability without leaking source speaker info.
In our later V2 version, we found that f0 rescaling followed by a NSF-style harmonic-plus-noise conditioning (as is done in RVC) results in better quality.
I know of one: transgender people often would like to alter the timbre of their voice and spend a lot of time training their voice. At least for online scenarios, this can just do it.
But other than that AI voice altering research seems like it benefits mostly scammers? I’m just wondering what they tell themselves they’re doing. I didn’t see this in the paper.
Related
Video annotator: a framework for efficiently building video classifiers
The Netflix Technology Blog presents the Video Annotator (VA) framework for efficient video classifier creation. VA integrates vision-language models, active learning, and user validation, outperforming baseline methods with an 8.3 point Average Precision improvement.
Generating audio for video
Google DeepMind introduces V2A technology for video soundtracks, enhancing silent videos with synchronized audio. The system allows users to guide sound creation, aligning audio closely with visuals for realistic outputs. Ongoing research addresses challenges like maintaining audio quality and improving lip synchronization. DeepMind prioritizes responsible AI development, incorporating diverse perspectives and planning safety assessments before wider public access.
We increased our rendering speeds by 70x using the WebCodecs API
Revideo, a TypeScript framework, boosted rendering speeds by 70 times with WebCodecs API. Challenges overcome by browser-based video encoding. Limited audio processing and browser compatibility remain.
Introducing Voice Isolator and Background Noise Remover
The website offers a Voice Isolator tool for extracting clear speech by removing background noise from audio. Users can try a sample, access FAQs, and explore other AI audio solutions by ElevenLabs.
AI speech generator 'reaches human parity' – but it's too dangerous to release
Microsoft's VALL-E 2 AI speech generator replicates human voices accurately using minimal audio input. Despite its potential in various fields, Microsoft refrains from public release due to misuse concerns.