June 20th, 2024

Generating audio for video

Google DeepMind introduces V2A technology for video soundtracks, enhancing silent videos with synchronized audio. The system allows users to guide sound creation, aligning audio closely with visuals for realistic outputs. Ongoing research addresses challenges like maintaining audio quality and improving lip synchronization. DeepMind prioritizes responsible AI development, incorporating diverse perspectives and planning safety assessments before wider public access.

Read original articleLink Icon
Generating audio for video

Google DeepMind has introduced a new technology called video-to-audio (V2A) that generates synchronized soundtracks for videos using video pixels and text prompts. This advancement allows for the creation of rich soundscapes for silent videos, enhancing the overall viewing experience. The V2A system can produce various soundtracks for any video input, offering users the ability to guide the generated output towards desired sounds. By encoding video input and refining audio through a diffusion model, the technology aligns audio closely with visual prompts, creating realistic audio outputs. However, challenges such as maintaining audio quality with varying video inputs and improving lip synchronization for speech in videos are being addressed through ongoing research. DeepMind emphasizes responsible AI development, incorporating diverse perspectives to ensure the technology's positive impact and safeguarding against potential misuse through watermarking AI-generated content. Rigorous safety assessments are planned before wider public access to the V2A technology.

Related

Video annotator: a framework for efficiently building video classifiers

Video annotator: a framework for efficiently building video classifiers

The Netflix Technology Blog presents the Video Annotator (VA) framework for efficient video classifier creation. VA integrates vision-language models, active learning, and user validation, outperforming baseline methods with an 8.3 point Average Precision improvement.

Optimizing AI Inference at Character.ai

Optimizing AI Inference at Character.ai

Character.AI optimizes AI inference for LLMs, handling 20,000+ queries/sec globally. Innovations like Multi-Query Attention and int8 quantization reduced serving costs by 33x since late 2022, aiming to enhance AI capabilities worldwide.

Lessons About the Human Mind from Artificial Intelligence

Lessons About the Human Mind from Artificial Intelligence

In 2022, a Google engineer claimed AI chatbot LaMDA was self-aware, but further scrutiny revealed it mimicked human-like responses without true understanding. This incident underscores AI limitations in comprehension and originality.

Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]

Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]

The video discusses limitations of large language models in AI, emphasizing genuine understanding and problem-solving skills. A prize incentivizes AI systems showcasing these abilities. Adaptability and knowledge acquisition are highlighted as crucial for true intelligence.

The Encyclopedia Project, or How to Know in the Age of AI

The Encyclopedia Project, or How to Know in the Age of AI

Artificial intelligence challenges information reliability online, blurring real and fake content. An anecdote underscores the necessity of trustworthy sources like encyclopedias. The piece advocates for critical thinking amid AI-driven misinformation.

Link Icon 10 comments
By @crazygringo - 7 months
Very very cool.

But I literally can't keep track anymore of which AI generative combinations of modalities have been released.

Crazy how two years ago this would have blown my mind. Now it's just, OK sure add it to the pile...

By @gundmc - 7 months
The AI slop problem is bad enough on TikTok/YouTube today. I shudder at the future of user-generated video platforms. I also wonder if the low barrier to create these videos will outpace the storage and processing capacity of the free platforms.
By @TheAceOfHearts - 7 months
Wouldn't it be better to generate multiple tracks that can be mixed / tweaked together, rather than a single track? That way you can also keep the parts you like and continue iterating on the parts you dislike.

If the sound is already being generated at a specific time, surely you can make it generate an output that can be consumed by existing audio mixing tools for further refinement.

The problem with doing these all-in-one integrated solutions is that you're kinda giving people an all-or-nothing option, which doesn't seem that useful. Maybe I'll end up being proven wrong.

By @peppertree - 7 months
I wonder if this can be trained to do lip reading.
By @masto - 7 months
I don't know if a computer can ever match the perfection of "shreds" videos. (The drum example came close)

https://www.youtube.com/playlist?list=PLQvwVDViTLXu4usHto8PH...

By @squarefoot - 7 months
As a wannabe drummer i can say the drumming example is quite bad as the drummer doesn't seem to hit toms that often to produce tom rolls, however the video is so heavily cropped that either I'm wrong or the AI was deliberately fed with something difficult to interpret.
By @nanovision - 7 months
This is so cool.
By @animanoir - 7 months
Boooring!