Show HN: Infinity – Realistic AI characters that can speak
Infinity AI has developed a groundbreaking video model that generates expressive characters from audio input, trained for 11 GPU years at a cost of $500,000, addressing limitations of existing tools.
Infinity AI has developed a novel foundation video model that integrates audio input to create expressive, realistic characters capable of speaking. This marks a significant advancement in video generation technology, as it is reportedly the first video diffusion transformer designed for this purpose. Users can experiment with the model by visiting their website or by requesting custom video generation through comments. The model, which has undergone extensive training equivalent to 11 GPU years and an investment of approximately $500,000, aims to overcome limitations found in existing generative AI video tools, which often rely on lip-syncing and can produce mismatched gestures and expressions. The new model is designed to take a single image, audio, and other signals to generate video, capturing the complexities of human motion and emotion. While it performs well in various areas, such as handling multiple languages and generating realistic physics in animations, it has limitations, including difficulties with non-humanoid images and occasional identity distortions. The team is actively working on further improvements and welcomes feedback from users.
- Infinity AI has created a video model that generates characters speaking based on audio input.
- The model is the first of its kind, trained for 11 GPU years at a cost of around $500,000.
- It aims to address issues with existing AI video tools that rely on lip-syncing.
- The model can handle multiple languages and generate realistic animations but struggles with non-humanoid images.
- Feedback from users is encouraged as the team continues to refine the technology.
Related
Generating audio for video
Google DeepMind introduces V2A technology for video soundtracks, enhancing silent videos with synchronized audio. The system allows users to guide sound creation, aligning audio closely with visuals for realistic outputs. Ongoing research addresses challenges like maintaining audio quality and improving lip synchronization. DeepMind prioritizes responsible AI development, incorporating diverse perspectives and planning safety assessments before wider public access.
AI speech generator 'reaches human parity' – but it's too dangerous to release
Microsoft's VALL-E 2 AI speech generator replicates human voices accurately using minimal audio input. Despite its potential in various fields, Microsoft refrains from public release due to misuse concerns.
Tuning-Free Personalized Image Generation
Meta AI has launched the "Imagine yourself" model for personalized image generation, improving identity preservation, visual quality, and text alignment, while addressing limitations of previous techniques through innovative strategies.
OpenAI rolls out voice mode after delaying it for safety reasons
OpenAI is launching a new voice mode for ChatGPT, capable of detecting tones and processing audio directly. It will be available to paying customers by fall, starting with limited users.
CogVideoX: A Cutting-Edge Video Generation Model
ZhipuAI launched CogVideoX, an advanced video generation model featuring a 3D Variational Autoencoder for efficient data compression and an end-to-end understanding model, enhancing video generation and instruction responsiveness.
- Users are impressed by the technology's ability to generate expressive characters from audio, with many sharing their own creations.
- Concerns about the potential for misuse, such as deepfakes and copyright issues, are prevalent among commenters.
- Several users inquire about the model's limitations, including video length and voice variety.
- There is a desire for additional features, such as an API for integration and options for customizing voices.
- Some users express skepticism about the realism and quality of the generated videos, particularly with longer inputs.
EDIT: looks like the model doesn't like Duke Nukem: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
Cropping out his pistol only made it worse lol: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
A different image works a little bit better, though: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
My go-to for checking the edges of video and face identification LLMs are Personas right now -- they're rendered faces done in a painterly style, and can be really hard to parse.
Here's some output: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
Source image from: https://personacollective.ai/persona/1610
Overall, crazy impressive compared to competing offerings. I don't know if the mouth size problems are related to the race of the portrait, the style, the model, or the positioning of the head, but I'm looking forward to further iterations of the model. This is already good enough for a bunch of creative work, which is rad.
It’s astounding that 2 sentences generated this. (I used text-to-image and the prompt for a space marine in power armour produced something amazing with no extra tweaks required).
[0]: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
Heads up, little bit of language in the audio.
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
Looks like too much Italian training data
I am curious if you are anyway related to this team?
One drawback of tools like runway (and midjourney) is the lack of an API allowing integration into products. I would love to re-sell your service to my clients as part of a larger offering. Is this something you plan to offer?
The examples are very promising by the way.
Sorry, if this question sounds dumb, but I am comparing it with regular image models, where the more images you have, the better output images you generate for the model.
First, your (Lina's) intro is perfect in honestly and briefly explaining your work in progress.
Second, the example I tried had a perfect interpretation of the text meaning/sentiment and translated that to vocal and facial emphasis.
It's possible I hit on a pre-trained sentence. With the default manly-man I used the phrase, "Now is the time for all good men to come to the aid of their country."
Third, this is a fantastic niche opportunity - a billion+ memes a year - where each variant could require coming back to you.
Do you have plans to be able to start with an existing one and make variants of it? Is the model such that your service could store the model state for users to work from if they e.g., needed to localize the same phrase or render the same expressivity on different facial phenotypes?
I can also imagine your building different models for niches: faces speaking, faces aging (forward and back); outside of humans: cartoon transformers, cartoon pratfalls.
Finally, I can see both B2C and B2B, and growth/exit strategies for both.
!NWSF --lyrics by Biggy$malls
A product that might be build on top of this could split the input into reasonable chunks, generate video for each of them separately and stitch them with another model that can transition from one facial expression into another in a fraction of a second.
Additional improvement might be feeding the system not with one image but with a few expressing different emotional expressions. Then the system could analyze the split input to find out in which emotional state each part of the video should be started on.
On unrelated note ... generated expressions seem to be relevant to the content of the input text. So either text to speech might understand language a bit or the video model itself.
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
Managed to get it working with my doggo.
I have an immediate use case for this. Can you stream via AI to support real time chat this way?
Very very good!
Jonathan
founder@ixcoach.com
We deliver the most exceptional simulated life coaching, counseling and personal development experiences in the world through devotion to the belief that having all the support you need should be a right, not a privilege.
Test our capacity at ixcoach.com for free to see for yourself.
My generation: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
I thought you had to pay artists for a license before using their work in promotional material.
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
Edit: Duke Nukem flubs his line: https://youtu.be/mcLrA6bGOjY
One small issue I've encountered is that sometimes images remain completely static. Seems to happen when the audio is short - 3 to 5 seconds long.
I feel like I accidentally made an advert for whitening toothpaste:
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
I am sure the service will get abused, but wish you lots of success.
I've been working on something adjacent to this concept with Ragdoll (https://github.com/bennyschmidt/ragdoll-studio), but focused not just on creating characters but producing creative deliverables using them.
I get the benefit of using celebrities because it's possible to tell if you actually hit the mark, whereas if you pick some random person you can't know if it's correct or even stable. But jeez... Andrew Tate in the first row? And it doesn't get better as I scroll down...
I noticed lots of small clips so I tried a longer script, and it seems to reset the scene periodically (every 7ish seconds). It seems hard to do anything serious with only small clips...?
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
Hello I'm an AI-generated version of Yann LeCoon. As an unbiased expert, I'm not worried about AI. ... If somehow an AI gets out of control ... it will be my good AI against your bad AI. ... After all, what does history show us about technology-fueled conflicts among petty, self-interested humans?
Related
Generating audio for video
Google DeepMind introduces V2A technology for video soundtracks, enhancing silent videos with synchronized audio. The system allows users to guide sound creation, aligning audio closely with visuals for realistic outputs. Ongoing research addresses challenges like maintaining audio quality and improving lip synchronization. DeepMind prioritizes responsible AI development, incorporating diverse perspectives and planning safety assessments before wider public access.
AI speech generator 'reaches human parity' – but it's too dangerous to release
Microsoft's VALL-E 2 AI speech generator replicates human voices accurately using minimal audio input. Despite its potential in various fields, Microsoft refrains from public release due to misuse concerns.
Tuning-Free Personalized Image Generation
Meta AI has launched the "Imagine yourself" model for personalized image generation, improving identity preservation, visual quality, and text alignment, while addressing limitations of previous techniques through innovative strategies.
OpenAI rolls out voice mode after delaying it for safety reasons
OpenAI is launching a new voice mode for ChatGPT, capable of detecting tones and processing audio directly. It will be available to paying customers by fall, starting with limited users.
CogVideoX: A Cutting-Edge Video Generation Model
ZhipuAI launched CogVideoX, an advanced video generation model featuring a 3D Variational Autoencoder for efficient data compression and an end-to-end understanding model, enhancing video generation and instruction responsiveness.