Meta Movie Gen
Meta has launched Movie Gen, an AI model for creating and editing high-definition videos from text inputs, allowing personalized content generation and sound integration while emphasizing responsible AI development.
Read original articleMeta has introduced Movie Gen, an advanced AI model designed for creating immersive media content. This innovative tool allows users to generate custom videos and sounds from simple text inputs, edit existing videos, and transform personal images into unique video content. Movie Gen is capable of producing high-definition videos in various aspect ratios, marking a significant advancement in the industry. Users can input descriptive text to create scenes, such as a girl running on the beach or a sloth floating in a pool, and the AI will generate corresponding videos. Additionally, Movie Gen offers precise video editing capabilities, enabling users to modify styles, transitions, and other elements through text commands. The platform also supports the creation of personalized videos by uploading images, ensuring that human identity and motion are preserved. Furthermore, Movie Gen can generate sound effects and soundtracks to accompany the videos, enhancing the overall experience. Meta emphasizes the importance of building AI responsibly, focusing on trust and safety in its applications. The company encourages users to explore its research paper for more insights into the benchmarks set by Movie Gen in media generation.
- Meta's Movie Gen allows video creation and editing from text inputs.
- The AI can produce high-definition videos in various aspect ratios.
- Users can upload images to create personalized videos while preserving identity.
- The platform generates sound effects and soundtracks to enhance video content.
- Meta prioritizes responsible AI development focused on trust and safety.
Related
Generating audio for video
Google DeepMind introduces V2A technology for video soundtracks, enhancing silent videos with synchronized audio. The system allows users to guide sound creation, aligning audio closely with visuals for realistic outputs. Ongoing research addresses challenges like maintaining audio quality and improving lip synchronization. DeepMind prioritizes responsible AI development, incorporating diverse perspectives and planning safety assessments before wider public access.
Meta 3D Gen
Meta introduces Meta 3D Gen (3DGen), a fast text-to-3D asset tool with high prompt fidelity and PBR support. It integrates AssetGen and TextureGen components, outperforming industry baselines in speed and quality.
Instagram starts letting people create AI versions of themselves
Meta has launched AI Studio, enabling US users to create customizable AI versions of themselves for Instagram, aimed at enhancing interaction while managing content and engagement with followers.
Show HN: Infinity – Realistic AI characters that can speak
Infinity AI has developed a groundbreaking video model that generates expressive characters from audio input, trained for 11 GPU years at a cost of $500,000, addressing limitations of existing tools.
Meta confirms it trains its AI on any image you ask Ray-Ban Meta AI to analyze
Meta can use images shared with its Ray-Ban Meta AI for training, raising privacy concerns as users may unknowingly provide sensitive data. Users must opt out to prevent data usage.
- Many commenters express concerns about the quality and authenticity of AI-generated videos, noting a distinct "AI sheen" and unrealistic elements.
- There are worries about the potential misuse of the technology for misinformation and deepfakes, with calls for regulation and watermarking.
- Some users see the technology as a tool for democratizing content creation, allowing more people to produce videos without significant resources.
- Critics question the societal impact of AI-generated content, fearing it may overwhelm genuine human creativity and lead to a decline in quality.
- Overall, the comments reflect a mix of excitement for the technology's potential and apprehension about its implications for the future of media and creativity.
The video’s look cool, but I can’t really enjoy reading about them if my phone freezes every 2 seconds.
As any pre-schooler will be able to produce anything (watch out parents) imaginable in seconds doesn't make it better to me or is of any real value.
Ok, i needed to edit it again to add: maybe this IS the value of it. We can totally forget about phantasizing stories with visuals (movies) because nobody will care anymore.
It’s already nearly impossible to find quality content on the internet if you don’t know where to look at.
- Every script in Hollywood will now be submitted with a previs movie.
- Manga to anime converters.
- Online commercials for far more products.
My mind instantly assumes it a money thing and they're just wanting to charge millions for it, therefore out of reach for the general public. But then with Meta's whole stance on open ai models, that doesn't seem to ring true.
Always important to bear in mind that the examples they show are likely the best examples they were able to produce.
Many times over the past few years a new AI release has "wowed" me, but none of them resulted in any sudden overnight changes to the world as we know it.
VFX artists: You can sleep well tonight, just keep an eye on things!
The problem: In my limited playing of these tools they don't quite make the mark and I would easily be able to tweak something if I had all the layers used. I imagine in the future products could be used to tweak this to match what I think the output should be....
At least the code generation tools are providing source code. Imagine them only giving compiled bytecode.
Scale? I have access to an H100. Meta trained their cat video stuff on six thousand H100s.
They mention that these consume 700W each. Do they pay domestic rates for power? Is that really only $500 per hour of electricity?
At the level of image/video synthesis: Some leading companies have suggested they put watermarks in the content they create. Nice thought, but open source will always be an option, and people will always be able to build un-watermarked tools.
At the level of law: You could attempt to pass a law banning image/video generation entirely, or those without watermarks, but same issue as before– you can't stop someone from building this tech in their garage with open-source software.
At the level of social media platforms: If you know how GANs work, you already know this isn't possible. Half of image generation AI is an AI image detector itself. The detectors will always be just about as good as the generators- that's how the generators are able to improve themselves. It is, I will not mince words, IMPOSSIBLE to build an AI detector that works longterm. Because as soon as you have a great AI content classifier, it's used to make a better generator that outsmarts the classifier.
So... smash the looms..?
It being 30B gives me hope.
Seriously though. This is the company that is betting hard on VR goggles. And these are engines that can produce real time dreams, 3d, photographic quality, obedient to our commands. No 3d models needed, no physics simulations, no ray tracing, no prebuilt environments and avatars. All simply dreamed up in real time, as requested by the user in natural language. It might be one of the most addictive technologies ever invented.
Digital minimalism is looking more and more attractive.
It's going to be interesting to see how that plays out when you can make just about any kind of media you wish. (Especially when you can mix this as a form of 'embodiment' to realize relationships with virtual agents operated by LLMs.)
RIP Pika and ElevenLabs… tho I guess they always can offer convenience and top tier UX. Still, gotta imagine they’re panicking this morning!
Upload an image of yourself and transform it into a personalized video. Movie Gen’s cutting-edge model lets you create personalized videos that preserve human identity and motion.
Given how effective the still images of Trump saving people in floodwater and fixing electrical poles have been despite being identifiable as AI if you look closely (or think…), this is going to be nuts. 16 seconds is more than enough to convince people, I’m guessing the average video watch time is much less than that on social media.Also, YouTube shorts (and whatever Meta’s version is) is about to get even worse, yet also probably more addicting! It would be hard to explain to an alien why we got so unreasonably good at optimal content to keep people scrolling. Imagine an automated YouTube channel running 24/7 A/B experiments for some set of audiences…
These are smooth, consistent, no landslide (except sloth floating in water, the stones on right are moving at much higher rate than the dock coming closer), no things appearing out of nowhere. Editing seems not as high quality (the candle to bubble example).
To me, these didn't induce nausea while being very high quality makes it best among current video generators.
Before you downvote, don't get this as a belittling the effort and all the results, they are stunning, but as a sincere question.
I do plenty of photography, I do a lot of videography. I know my way around Premiere Pro, Lightroom and After Effects. I also know a decent amount about computer vision and cg.
If I look at the "edited" videos, they look fake. Immediately. And not a little bit. They look like they were put through a washing machine full of effects: too contrasty, too much gamma, too much clarity, too low levels, like a baby playing with the effect controls. Can't exactly put my fingers on, but comparing the "original" videos to the ones that simply change one element, like the "add blue pom poms to his hands", it changes the whole video, and makes the whole video a bit cartooney, for lack of a better word.
I am simply wondering why?!
Is that a change in general through the model that processes the video? Is that something that is easy to get rid of in future versions, or inherently baked into how the model transforms the video?
Curious if anybody has a solution or if this works for that
Anything longer than a single clip is just a bunch of these clips stitched together.
What I hope (since I am building a story telling front-end for AI generated video) is that they consider b2c and selling this as a bulk service over an api.
But I'm worried about this tech being used for propaganda and dis information.
Someone with a 1K computer and enough effort can generate a video that looks real enough. Add some effects to make it look like it was captured by a CCTV or another low res camera.
This is what we know about, who knows what's behind NDAs or security clearances.
I will now review some of the standout clips.
That alien thing in the water is horrifying. The background fish look pretty convincing, except for the really flamboyant one in the dark.
I guess I should be impressed that the kite string seems to be rendered every frame and appears to be connected between the hand and the kite most of the time. The whole thing is really stressful though.
drunk sloth with weirdly crisp shadow should take the top slot from girl in danger of being stolen by kite.
man demonstrates novel chain sword fire stick with four or five dimensions might be better off in the bin...
> The camera is behind a man. The man is shirtless, wearing a green cloth around his waist. He is barefoot. With a fiery object in each hand, he creates wide circular motions. A calm sea is in the background. The atmosphere is mesmerizing, with the fire dance.
This just reads like slightly clumsy lyrics to a lost Ween song.
https://ai.meta.com/blog/movie-gen-media-foundation-models-g...
I'd rather have those people work on climate change solutions
I can see myself paying a little too much to have a local setup for this.
Anyone able to update/inform a dinosaur?
> Upload an image of yourself and transform it
> into a personalized video. Movie Gen’s
> cutting-edge model lets you create personalized
> videos that preserve human identity and motion.
A stalker’s dream! I’m sure my ex is going to love all the videos I’m going to make of her!Jokes aside, it’s a little bizarre to me that they treat identity preservation as a feature while competitors treat that as a bug, explicitly trying not to preserve identity of generated content to minimize deepfake reputation risk.
Any woman could have flagged this as an issue before this hit the public.
Especially based on the examples on this site, it's not a far reach to say that they will start to generate video ads of you (yes, YOU! your face! You've already uploaded hundreds of photos for them to reference!) using a specific product and showing how happy you are because you bought it. Imagine scrolling Instagram and seeing your own face smelling some laundry detergent or laughing because you took some prescription medicine.
Is it available for use now? Nope
When will it be available for use? On FB, IG and WhatsApp in 2025
Will it be open sourced? Maybe
What are they doing before releasing it? Working with filmmakers, improving video quality, reducing inference time
For a long time people have speculated about The Singularity. What happens when AI is used to improve AI in a virtuous circle of productivity? Well, that day has come. To generate videos from text you need video+text pairs to train on. They get that text from more AI. They trained a special Llama3 model that knows how to write detailed captions from images/video and used it to consistently annotate their database of approx 100M videos and 1B images. This is only one of many ways in which they deployed AI to help them train this new AI.
They do a lot of pre-filtering on the videos to ensure training on high quality inputs only. This is a big recent trend in model training: scaling up data works but you can do even better by training on less data after dumping the noise. Things they filter out: portrait videos (landscape videos tend to be higher quality, presumably because it gets rid of most low effort phone cam vids), videos without motion, videos with too much jittery motion, videos with bars, videos with too much text, video with special motion effects like slideshows, perceptual duplicates etc. Then they work out the "concepts" in the videos and re-balance the training set to ensure there are no dominant concepts.
You can control the camera because they trained a dedicated camera motion classifier and ran that over all the inputs, the outputs are then added to the text captions.
The text embeddings they mix in are actually a concatenation of several models. There's MetaCLIP providing the usual understanding of what's in the request, but they also mix in a model trained on character-level text so you can request specific spellings of words too.
The AI sheen mentioned in other comments mostly isn't to do with it being AI but rather because they fine-tune the model on videos selected for being "cinematic" or "aesthetic" in some way. It looks how they want it to look. For instance they select for natural lighting, absence of too many small objects (clutter), vivid colors, interesting motion and absence of overlay text. What remains of the sheen is probable due to the AI upsampling they do, which lets them render videos at a smaller scale followed by a regular bilinear upsample + a "computer, enhance!" step.
They just casually toss in some GPU cluster management improvements along the way for training.
Because the MovieGen was trained on Llama3 generated captions, it's expecting much more detailed and high effort captions than users normally provide. To bridge the gap they use a modified Llama3 to rewrite people's prompts to become higher detail and more consistent with the training set. They dedicated a few paragraphs to this step, but it nonetheless involves a ton of effort with distillation for efficiency, human evals to ensure rewrite quality etc.
I can't even begin to imagine how big of a project this must have been.
Impressive on the relative quality of the output. And of the productivity gains, sure.
But meh on the substance of it. It may be a dream for (financial) producers. For the direct customers as well (advertisement obviously, again). But for creators themselves (who are to be their own producers at some point, for some)?
On the maker side, art/work you don't sweat upon has little interest and emotional appeal. You shape it about as much as it shapes you.
On the viewer side, art that's not directed and produced by a human has little interest, connection and appeal as well. You can't be moved by something that's been produced by someone or something you can't relate to. Especially not a machine. It may have some accidental aesthetic interest, much like generative art had in the past. But uninhabited by someone's intent, it's just void of anything.
I know it's not the mainstream opinion, but Generative AI every day sounds more and more like cryptocurrencies and NFTs and these kinds of technologies that did not find _yet_ their defining problem to which they could be a solution.
It will not make you creative. It will not give you taste or talent. It is a technical tool that will mostly be used to produce cheap garbage unless you develop the skills to use it as a part of your creative toolkit -- which should also include many, many other things.
Yeah, we might get the bad killer robots. But it's more likely this will make it unnecessary to wonder where on this blue planet you can still live when we power the deserts with solar and go to space. Getting clean nutrition and environment will be within reach. I think that's great.
As with all technology: Yes a car is faster than you. And you can buy or rent one. But it's still great to be healthy and able to jog. So keep your brains folks and get some skills :)
#cabincrew
#scarletjohanson
#amen
As it stands, the only chance you have of depicting a consistent story across a series of shots is image-to-video, presuming you can use LoRAs or similar techniques to get the seed photos consistent in themselves.
Like cool a movie doesn’t need to cost $200 million or whatever.
Imagine if those creative types were freed up to do something different. What would we see? Better architecture and factories? Maybe better hospitals?
thats the most amenable approach to ai filmmaking ive seen available yet.
id have to see wayyy more pencil sketch conversions to see exactly whats going on....
...but that right there is the easiest way to hack making movies - with the most control.....so far...
I’m here looking at users and wondering - the content pipelines are broader, but the exit points of attention and human brains are constant. How the heck are you supposed to know if your content is valid?
During a recent apple event, someone on YT had an AI generated video of Tim Cook announcing a crypto collaboration; it had a 100k users before it was taken down.
Right now, all the videos of rockets falling on Israel can be faked. Heck, the responses on the communities are already populated by swathes of bots.
It’s simply cheaper to create content and overwhelm society level filters we inherited from an era of more expensive content creation.
Before anyone throws the sink at me for being a Luddite or raining on the parade - I’m coming from the side where you deal with the humans who consume content, and then decide to target your user base.
Yes, the vast majority of this is going to be used to create lovely cat memes and other great stuff.
At the same time, it takes just 1 post to act as a lightning rod and blow up things.
Edit:
From where I sit, there are 3 levels of issues.
1) Day to day arguments - this is organic normal human stuff
2) Bad actors - this is spammers, hate groups, hackers.
3) REALLY Bad actors - this is nation states conducting information warfare. This is countries seeding African user bases with faked stories, then using that as a basis for global interventions.
This is fake videos of war crimes, which incense their base and overshadow the harder won evidence of actual war crimes.
This doesn’t seem real, but political forces are about perception, not science and evidence.
It's only going to get better, faster, cheaper, easier.[a]
Sooner than anyone could have expected, we'll be able to ask the machines: "Turn this book into a two-hour movie with the likeness of [your favorite actor/actress] in the lead role."
Sooner than anyone could have expected, we'll be able to have immersive VR experiences that are crafted to each person.
Sooner than anyone could have expected, we won't be able to identify deepfakes anymore.
We sure live in interesting times!
---
[a] With apologies to Daft Punk: https://www.youtube.com/watch?v=gAjR4_CbPpQ
“I want a funny road trip movie staring Jim Carey and Chris Farley, based in Europe, in the fall, where they have to rescue their mom played by Lucille ball from making the mistake of marrying a character played by an older Steve Martin.”
10 minutes later your movie is generated.
If you like it, you save it, share it, etc.
You have a queue of movies shared by your friends that they liked.
Content will be endless and generated.
[1] https://www.dw.com/en/whatsapp-in-india-scourge-of-violence-...
They're not really showing signs of slowing down either. Hey, Zuck, always thought you were kind of lame in the past. But maybe you weren't a one trick pony after all.
From Twitter/X:
Today we’re premiering Meta Movie Gen: the most advanced media foundation models to-date.
Developed by AI research teams at Meta, Movie Gen delivers state-of-the-art results across a range of capabilities. We’re excited for the potential of this line of research to usher in entirely new possibilities for casual creators and creative professionals alike.
More details and examples of what Movie Gen can do https://go.fb.me/kx1nqm
Movie Gen models and capabilities Movie Gen Video: 30B parameter transformer model that can generate high-quality and high-definition images and videos from a single text prompt.
Movie Gen Audio: A 13B parameter transformer model that can take a video input along with optional text prompts for controllability to generate high-fidelity audio synced to the video. It can generate ambient sound, instrumental background music and foley sound — delivering state-of-the-art results in audio quality, video-to-audio alignment and text-to-audio alignment.
Precise video editing: Using a generated or existing video and accompanying text instructions as an input it can perform localized edits such as adding, removing or replacing elements — or global changes like background or style changes.
Personalized videos: Using an image of a person and a text prompt, the model can generate a video with state-of-the-art results on character preservation and natural movement in video.
We’re continuing to work closely with creative professionals from across the field to integrate their feedback as we work towards a potential release. We look forward to sharing more on this work and the creative possibilities it will enable in the future.
Related
Generating audio for video
Google DeepMind introduces V2A technology for video soundtracks, enhancing silent videos with synchronized audio. The system allows users to guide sound creation, aligning audio closely with visuals for realistic outputs. Ongoing research addresses challenges like maintaining audio quality and improving lip synchronization. DeepMind prioritizes responsible AI development, incorporating diverse perspectives and planning safety assessments before wider public access.
Meta 3D Gen
Meta introduces Meta 3D Gen (3DGen), a fast text-to-3D asset tool with high prompt fidelity and PBR support. It integrates AssetGen and TextureGen components, outperforming industry baselines in speed and quality.
Instagram starts letting people create AI versions of themselves
Meta has launched AI Studio, enabling US users to create customizable AI versions of themselves for Instagram, aimed at enhancing interaction while managing content and engagement with followers.
Show HN: Infinity – Realistic AI characters that can speak
Infinity AI has developed a groundbreaking video model that generates expressive characters from audio input, trained for 11 GPU years at a cost of $500,000, addressing limitations of existing tools.
Meta confirms it trains its AI on any image you ask Ray-Ban Meta AI to analyze
Meta can use images shared with its Ray-Ban Meta AI for training, raising privacy concerns as users may unknowingly provide sensitive data. Users must opt out to prevent data usage.