October 30th, 2024

Pushing the Frontiers of Audio Generation

Google DeepMind has advanced audio generation technology, enabling natural digital interactions and long-form dialogues. Their latest model improves efficiency and quality while emphasizing responsible AI development and future integration with other media.

Read original article

Pushing the Frontiers of Audio Generation

Google DeepMind has made significant advancements in audio generation technology, enhancing the naturalness and interactivity of digital assistants and AI tools. Their research focuses on creating high-quality, dynamic speech from various inputs, which is integrated into products like Gemini Live and YouTube's auto dubbing. Recent innovations include features that generate long-form, multi-speaker dialogues, such as NotebookLM Audio Overviews and Illuminate, which aim to make complex content more accessible. The technology builds on previous models like SoundStorm and AudioLM, which utilize neural audio codecs and language modeling techniques to produce realistic audio. The latest model can generate two minutes of dialogue in under three seconds, significantly improving efficiency and quality. This model was trained on extensive speech data and fine-tuned with high-quality dialogue to ensure realistic exchanges. DeepMind is also committed to responsible AI development, incorporating watermarking technology to prevent misuse of AI-generated audio. Future goals include enhancing expressivity and exploring the integration of audio with other modalities like video, with the potential to transform learning experiences and accessibility.

- Google DeepMind has advanced audio generation technologies for more natural digital interactions.

- New features allow for the generation of long-form, multi-speaker dialogues to enhance content accessibility.

- The latest model can produce high-quality dialogue quickly, improving efficiency in audio generation.

- DeepMind emphasizes responsible AI development, including measures to prevent misuse of generated content.

- Future developments aim to enhance expressivity and integrate audio with other media formats.

Generating audio for video

Google DeepMind introduces V2A technology for video soundtracks, enhancing silent videos with synchronized audio. The system allows users to guide sound creation, aligning audio closely with visuals for realistic outputs. Ongoing research addresses challenges like maintaining audio quality and improving lip synchronization. DeepMind prioritizes responsible AI development, incorporating diverse perspectives and planning safety assessments before wider public access.

Gemini Pro 1.5 experimental "version 0801" available for early testing

Google DeepMind's Gemini family of AI models, particularly Gemini 1.5 Pro, excels in multimodal understanding and complex tasks, featuring a two million token context window and improved performance in various benchmarks.

Show HN: Infinity – Realistic AI characters that can speak

Infinity AI has developed a groundbreaking video model that generates expressive characters from audio input, trained for 11 GPU years at a cost of $500,000, addressing limitations of existing tools.

Google's new fake "podcast" summaries are disarmingly entertaining

Google's NotebookLM generates audio summaries of texts, as demonstrated by Kyle Orland's book on Minesweeper. While engaging, the AI content has inaccuracies, raising concerns about its reliability for academic use.

NotebookLM's automatically generated podcasts are surprisingly effective

Google's NotebookLM has launched Audio Overview, generating custom podcasts from user content with AI hosts. Powered by Gemini 1.5 Pro LLM, it raises questions about AI's future in media.

15 comments

By @tmjdev - 6 months

While it is impressive and I like to follow the advancements in this field, it is incredibly frustrating to listen to. I can't put my finger on why exactly. It's definitely closer to human-sounding, but the uncanny valley is so deep here that I find myself thinking "I just want the point, not the fake personality that is coming with it". I can't make it through a 30s demo.

By @jameszhao00 - 6 months

Try it out in the demo https://cloud.google.com/text-to-speech/?hl=en and in the API https://cloud.google.com/text-to-speech/docs/create-dialogue...

By @corry - 6 months

I think I put my finger on exactly why it sounds a bit uncanny-valley: it sounds like humans who are reading from a prepared 'bit' or 'script'.

We've all been on those webinars where it's clear -- despite the infusions (on cue) of "enthusiasm" from the speaker attempting to make it sound more natural and off-the-cuff -- that they are reading from a script.

It's a difficult-to-mask phenomenon for humans.

That all said, I actually have more grace for an AI sounding like this than I do for a human presenter reading from a script. Like, if I'm here "live" and paying attention to what you're saying, at least do me the service of truly being "here" with me and authentically communicating vs. simply reading something.

If you're going to simply read something, then just send it to me to read too - don't pretend it's a spontaneously synchronous communication.

By @seydor - 6 months

But what's the end goal and audience here? I don't believe people will resonate with robots making "um" and "ohs" because people usually resonate with an artist, a producer, a writer, a singer etc. A human layer with which people can empathize is essential. This can work as long as people are deceived and don't know there is no human behind it. If however i find out that a video is AI -generated i instantly lose interest in it. There are e.g. a lot of AI-generated architecture videos on youtube at the moment, i have never wanted to listen to one, because i know the emotions will be fake.

By @101008 - 6 months

It looks like lately a lot of progress have been made in audio generation / audio understanding (everything related to speech, I mean).

Is this related to LLM, or is this a completely different branch of AI, and is it just a coincidence? I am curious.

By @mg - 6 months

Is there a free (ad supported?) online tool without login that reads text that you paste into it?

I often would like to listen to a blog post instead of reading it, but haven't found an easy, quick solution yet.

I tried piping text through OpenAI's tts-1-hd, model and it is the first one I ever found that is human like enough for me to like listening to it. So I could write a tool for my own usecase that pipes the text to tts-1-hd and plays the audio. But maybe there is already something with a public web interface out there?

By @nilsherzig - 6 months

The voices are impressive (I can't tell the difference as a non native speaker) but their "personality" sounds extremely annoying lmao

By @jchanimal - 6 months

We've been using this at work to get inside of our customer's perspective. It's helpful to throw eg a bunch of point-of-sale data sync challenges into Notebook LM and eg pass a 10 minute audio to the team so they can understand where our work fits in.

By @akira2501 - 6 months

ah.. so "frontier" is the new buzzword that keeps the corporate board invested in this dead end?

frontier garbage.

By @ruffrey - 6 months

> This means it generates audio over 40-times faster than real time.

Astounding

By @henning - 6 months

To paraphrase the great Bertram Gilfoyle, computers don't need to produce fake vocal tics.

By @littlekey - 6 months

This is a "holy shit" moment for me, and I consider myself fairly jaded. If you listen closely you can tell it's a little off, but about halfway through I could clearly feel my brain click into a different mode where it believed what it was hearing was real.

By @lrkehab - 6 months

YouTube videos are already infested with insufferable AI elevator background "music". Even some channels that were previously good are using it.

On the bright side, you can stop watching these channels and have more time for serious things.

By @ironlake - 6 months

Is this another fake like the Google bot that made reservations at a restaurant?

Generating audio for video

Gemini Pro 1.5 experimental "version 0801" available for early testing

Show HN: Infinity – Realistic AI characters that can speak

Infinity AI has developed a groundbreaking video model that generates expressive characters from audio input, trained for 11 GPU years at a cost of $500,000, addressing limitations of existing tools.

Google's new fake "podcast" summaries are disarmingly entertaining

NotebookLM's automatically generated podcasts are surprisingly effective

Google's NotebookLM has launched Audio Overview, generating custom podcasts from user content with AI hosts. Powered by Gemini 1.5 Pro LLM, it raises questions about AI's future in media.

Pushing the Frontiers of Audio Generation

Related

Generating audio for video

Gemini Pro 1.5 experimental "version 0801" available for early testing

Show HN: Infinity – Realistic AI characters that can speak

Google's new fake "podcast" summaries are disarmingly entertaining

NotebookLM's automatically generated podcasts are surprisingly effective

Related

Generating audio for video

Gemini Pro 1.5 experimental "version 0801" available for early testing

Show HN: Infinity – Realistic AI characters that can speak

Google's new fake "podcast" summaries are disarmingly entertaining

NotebookLM's automatically generated podcasts are surprisingly effective