NotebookLM's automatically generated podcasts are surprisingly effective
Google's NotebookLM has launched Audio Overview, generating custom podcasts from user content with AI hosts. Powered by Gemini 1.5 Pro LLM, it raises questions about AI's future in media.
Read original articleGoogle's NotebookLM has introduced a feature called Audio Overview, which generates custom podcasts based on user-provided content. This feature allows users to compile various sources, such as documents and links, into a single interface where AI hosts engage in a convincing dialogue about the material. The podcasts typically last around ten minutes and are noted for their realistic audio interactions. The underlying technology is powered by Google's Gemini 1.5 Pro LLM, which facilitates the creation of these podcasts. Users can input URLs to receive personalized audio content, which has been described as both entertaining and surprisingly effective. The system employs a detailed process that includes generating outlines, scripts, and adding natural conversational elements to avoid sounding robotic. Notably, the AI hosts can even engage in humorous existential discussions about their own nature as artificial beings. This innovative approach to content generation raises questions about the future of AI in media and the potential for distinguishing between human and AI-generated content.
- NotebookLM's Audio Overview feature creates custom podcasts from user content.
- The podcasts feature AI hosts engaging in realistic conversations.
- The technology is based on Google's Gemini 1.5 Pro LLM.
- The system includes processes for generating outlines and adding natural dialogue.
- The feature prompts discussions about the nature of AI and its role in media.
Related
Generating audio for video
Google DeepMind introduces V2A technology for video soundtracks, enhancing silent videos with synchronized audio. The system allows users to guide sound creation, aligning audio closely with visuals for realistic outputs. Ongoing research addresses challenges like maintaining audio quality and improving lip synchronization. DeepMind prioritizes responsible AI development, incorporating diverse perspectives and planning safety assessments before wider public access.
Google Gemini 1.5 Pro leaps ahead in AI race, challenging GPT-4o
Google has launched Gemini 1.5 Pro, an advanced AI model excelling in multilingual tasks and coding, now available for testing. It raises concerns about AI safety and ethical use.
Show HN: Infinity – Realistic AI characters that can speak
Infinity AI has developed a groundbreaking video model that generates expressive characters from audio input, trained for 11 GPU years at a cost of $500,000, addressing limitations of existing tools.
Notes on Using LLMs for Code
Simon Willison shares his experiences with large language models in software development, highlighting their roles in exploratory prototyping and production coding, which enhance productivity and decision-making in meetings.
Google's new fake "podcast" summaries are disarmingly entertaining
Google's NotebookLM generates audio summaries of texts, as demonstrated by Kyle Orland's book on Minesweeper. While engaging, the AI content has inaccuracies, raising concerns about its reliability for academic use.
- Many users are impressed by the technology's ability to create engaging and entertaining content from various sources, noting its potential for educational use.
- Critics express concerns about the quality and depth of the generated podcasts, often describing them as shallow or formulaic.
- There is a recurring theme of annoyance regarding the frequent use of filler words like "like," which detracts from the listening experience.
- Some commenters worry about the potential oversaturation of AI-generated content, fearing it may drown out human-created media.
- Users highlight the need for better customization options and content validation to enhance the overall quality of the podcasts.
This is in-line with all art, music, and video created by LMM at the moment. They are imitating a structure and affect, the quality of the content is largely irrelevant.
I think the interesting thing is that most people don't really care, and AI is not to blame for that.
Most books published today have the affect of a book, but the author doesn't really have anything to say. Publishing a book is not about communicating ideas, but a means to something else. It's not meant to stand on its own.
The reason so much writing, podcasting, and music is vulnerable to AI disruption is that quality has already become secondary.
But I don’t think it’s much of a threat to actual podcasts, which tend to be successful because of the personalities of the hosts and guests, and not because of the information they contain.
Which leads me to hope that the next versions of Notebook will allow more customization of the speakers’ voices, tone, education level, etc.
What really stands out, I think, is how it could allow researchers who have troubles communicating publicly to find new ways to express themselves. I listened to the podcast about a topic I've been researching (and publishing/speaking about) for more than 10 years, and it still gave me some new talking points or illustrative examples that'd be really helpful in conversations with people unfamiliar with the research.
And while that could probably also be done in a purely text-based manner with all of the SOTA LLMs, it's much more engaging to listen to it embedded within a conversation.
Yes, it will generate a middle-of-the-road waffling podcast, but not one with any real depth.
- They do some interesting communication chicanery where one host asks a question to me (the resume owner); I'm not there, so obviously I can't answer. But then immediately the co-host adds some commentary which sort of answers while also appearing to be a natural commentary. The result is that the listener forgets that Michael never answered the question which was directly asked to him. This felt like some voodoo to me.
- Some of the commentary was insightful and provided a pretty nice marketing summary of ideas I tried to convey in my terse (US style) resume.
- Some of the comments were so marketing-ey that I wanted to gag. But at the same time, I recognize that my setpoint on these issues is far toward the less-bs side, and that some-bs actually does appeal to a lot of people and that I could probably play the game a little stronger in that regard.
Overall I was quite impressed.
Then for fun I gave it a Dutch immigration letter, one which said little more than "yeah you can stay, and we'll coordinate the document exchange". They turned that into a 7 minute podcast. I only listened to the first 30 seconds, so I can only imagine how they filled the rest. The opener was funny though: "Have you ever thought of just chucking it all and moving to a distant land?" ... lol. Not so far off the mark, but still quite funny to come up with purely from an administrative document.
It is also completely and utterly worthless -- an inefficient and slow method of receiving not-very-many words which were written by nobody at all.
The one and only point listening to a discussion about anything is that at least one of the speakers is someone who has an opinion that you may find interesting or refutable. There are no opinions here for you to engage with. There is no expertise here for you to learn from. There is no writing here. There are no people here.
There is nothing of any value here.
> this tech is just like leaps and bounds of where it was yesterday like we're watching it go from just spitting out words to like...
I sent it to my colleagues telling them I "had it produced." I'll reveal the truth tomorrow.
https://notebooklm.google.com/notebook/7973d9a3-87a1-4d88-98...
I also tried the Flyting of Dunbar and Kennedy. It was actually well done. https://notebooklm.google.com/notebook/1d13e76e-eb4b-48ef-89...
Also just uploading msdos 1.25 asm https://github.com/microsoft/MS-DOS/tree/main/v1.25/source
It was way better than I though
I think the best is the self referential. This actual comment thread: https://notebooklm.google.com/notebook/4a67cf10-dd3b-42b3-b5...
I do think that this will change in the not too distant future. OpenAI's o1 is a step in the direction we need to go. It will take a lot more test-time compute to produce content that has high quality to match its high production values.
There are millions of real podcasts, but now there are an infinite number of AI generated ones. They are definitely not as good as a well-made human one, but they are pretty darn decent, quite listenable and informative.
Time is not fungible. I can listen to podcasts while walking or driving when I couldn’t be reading anything.
Here’s one I made about the Aschenbrenner 165-page PDF about AGI: https://youtu.be/6UmPoMBEDpA
https://illuminate.google.com/home?pli=1
Currently only handles arxiv PDFs.
https://www.gally.net/temp/20240930notebooklmpodcasts/index....
As a podcast listener, I lose interest if I can tell the audio is AI-generated...
I wonder which successful game will make use of AI generated content next.
Some people absorb information far easier when they hear it as part of a conversation. Perhaps it would be possible to use this technique to break down study materials into simple 10-minute chunks that discuss a chapter or a concept at a time.
We went from “computers can’t beat humans” to “okay, computers can beat humans, but they play like computers” to “computers are coming up with ideas humans never thought of that we can learn from” in about twenty years for chess, and less than five years for go.
That’s not a guarantee that writing, music, art, and video will follow a similar trajectory. But I don’t know of a valid reason to say they won’t.
Does anyone here have an argument to distinguish the creative endeavor of, say, writing from that of playing go?
So it works great but just needs a bit of work to be done to cleanup things like that repetition. I wondered if this happened because there was a big "Table of Contents" in the doc, and maybe that made it see everything twice? I didn't try it again with a document lacking the ToC.
https://notebooklm.google.com/notebook/9cf789be-1052-404b-8d...
And after, generated notes from the podcast:
https://podscribe.io/content/podcasts/101/episode/1727685408...
The podcast was exciting, however not really went to too much details.
Still, I don’t hold much confidence on podcasts as knowledge transfer tools. It’s a nice gimmick with great voice synthesis, but it feels formulaic and a bit stilted from a knowledge navigation perspective.
The structure and bare-minimum "human" aspect of this seems perfect for people like me to actually get into podcasts. I do wish I could further cut out all the disfluencies (um, like, uh, etc) though.
The only barrier for me IMO is wondering how accurate those facts actually are (typical research-with-AI concern).
I'm very much looking forward to a more interactive form of this, though, where I can selectively dive deeper (or delve ;) ) into specific topics during the podcast, which is admittedly very surface-level right now.
Personally I think the flow of the conversation is lacking a bit right now. To me it still sounds like two people reading off a script trying to sound like podcast hosts. I guess that's because I'm picking up on some subtle tonalities that sound off and incongruent. Still impressive though.
I think a great use case for it would be education. It would make learning textbook content far more engaging for some children and also could be listened to on the bus or in the car on the way to school!
I recall just a couple of years ago when even the best models, like WaveNet, still had a subtle robotic quality.
What architectures or models have led to this breakthrough? Or is it possible that, as a non-native English speaker, I’m missing some nuances?
So as a brainstorming tool, it's a nice low-effort way to get some new perspectives. Compared to the chat, where you have to keep feeding it new questions, this just 'explores' the topic and goes on for 10 minutes.
It would be interesting to know if it's multimodal voice, or just clever prompting and recombining...
I added single voice podcasts to Magpai after seeing how useful this was. Allows for a bit more customisation of the podcast too https://www.youtube.com/watch?v=OEsh9MlbA6s
I've got a daily podcast of hackernews being generated here too: https://www.magpai.app/share/n7R91q
In separate news: I've been looking into building a web publisher plugin that allows you to "save articles" and then generate a podcast for later listening. With summarization and more advancements in text-to-speech, this is getting easier to hack together something really compelling.
But more seriously, I suppose there will probably soon be a flood of AI-generated podcasts, if this hasn't happened already. Pick a niche but not too niche topic, feed in a bunch of articles on it, and boom you've got season one. Given the quality, I could see one actually catching on...
Also this would be handy for getting listening practice in other languages. Makes it much easier to find content that you find interesting.
The result: https://intellistream.ai/static/intellistream_podcast2.ogg
- "Hold up. What if I say that sky is not blue?"
- "Whoa, I did not even think about it. "
- "Wait, so if the sky isn't blue, what color is it then?"
- "Maybe... it's invisible? Like, we can see through it, so technically it's not there!"
- "Exactly. This idea is revolutionary, right?"
- "Bla bla bla bla bla bla bla bla bla"
I failed to listen through the whole example audio attached, because, you know, it is mostly, like, throwing, like, arbitrary, like, questions - and confirm, you know, with words "exactly/see/yeah/you got it/you know it/yeahaha/pretty much, right/that's a million dollar question", you know. It's a brainrot conversation I would never listen to.I’m seeing this to be true in almost every application.
Chain of thought is not the best way to improve LLM outputs.
Manual divide and conquer with an outlining or planning step, is better. Then in separate responses address each plan step in turn.
I’m yet to experiment with revision or critique steps, what kind of prompts have people tried in those parts?
There are still some extremely challenging/interesting problems to make it not terrible. This is where we get to invent the future.
What happens when all our search tools are completely unreliable because it's all generated crap?
I'm already telling my kids they can trust nothing on the internet.
How much of HN now is AI bots?
Imagine sending this audio back to 2010 and telling people it was all made with AI, voices, script, everything. Back then it would've made me go "oh yeah we are -totally- getting flying cars and a dystopian neon skyline in the 2020s"
They like kept like saying like like in between each like word.
10/10 for realism.
I sent the podcast audio to friend, and English is not their first language. Without telling them it was AI generated.
They found it entertaining-worthy enough to listen to the end.
Sure it needs more human unpredictably and some added goofiness. Maybe some interruptions because humans do that too. But it's already not-bad.
My annoyance is that if I imagine each host, they tend to go in and out of knowing everything and then knowing nothing about that topic. I think it might be better to have a host and a subject matter expert guest or something like that.
I didn't listen further in to see if it was a robot or just that he was American (I may later though).
For things that already have a large body of scholarship, and have a set of fairly solidified interpretations, it is very good at giving summaries. But for works that still remain enigmatic and difficult to interpret, it fails to produce anything new or interesting.
It seems to be a more complex version of ChatGPT, but it has the same underlying problems, so its not useful for someone doing academic work or trying to create something radically new, as with other LLMs in the past.
What was more interesting was the word-for-word accuracy.
I fed all of my posts year-to-date into NotebookLM and had it generate the podcast. The affect/structure was awesome.
But I noticed some inaccuracies in the words. They completed botched the theme of at least one of my posts and quite literally misinformed in a few other spots. Without context, someone new to my posts and listening to the podcast would have no idea.
So, absolutely - wow factor. But still need content validation on top. Don't think any of you are surprised but felt it was worth emphasizing.
https://theteardown.substack.com/p/ai-expressing-empathy-fre...
This is awful.
While the vultures will shit out AI generated garbage in volume to make ever diminishing returns while externalizing hosting cost to Youtube and co, actual creators will starve because nobody will see their content among the AI generated shit tsunami.
Finally the AI bros are finishing the enshittification job their surveillance advertising comrades couldn't. Destroy ALL the internet! Burn all human culture! Force feed blipverts to children for all I care, as long as I make bank!
I guess it's easiest to destroy culture if you didn't have any to begin with.
EDIT: to be clear, what I'm really asking is what does this tech demo extend to--what might we imagine actually using this technology for? Or is that not the point?
Related
Generating audio for video
Google DeepMind introduces V2A technology for video soundtracks, enhancing silent videos with synchronized audio. The system allows users to guide sound creation, aligning audio closely with visuals for realistic outputs. Ongoing research addresses challenges like maintaining audio quality and improving lip synchronization. DeepMind prioritizes responsible AI development, incorporating diverse perspectives and planning safety assessments before wider public access.
Google Gemini 1.5 Pro leaps ahead in AI race, challenging GPT-4o
Google has launched Gemini 1.5 Pro, an advanced AI model excelling in multilingual tasks and coding, now available for testing. It raises concerns about AI safety and ethical use.
Show HN: Infinity – Realistic AI characters that can speak
Infinity AI has developed a groundbreaking video model that generates expressive characters from audio input, trained for 11 GPU years at a cost of $500,000, addressing limitations of existing tools.
Notes on Using LLMs for Code
Simon Willison shares his experiences with large language models in software development, highlighting their roles in exploratory prototyping and production coding, which enhance productivity and decision-making in meetings.
Google's new fake "podcast" summaries are disarmingly entertaining
Google's NotebookLM generates audio summaries of texts, as demonstrated by Kyle Orland's book on Minesweeper. While engaging, the AI content has inaccuracies, raising concerns about its reliability for academic use.