Genie 2: A large-scale foundation world model
Google DeepMind's Genie 2 is a foundation model that creates diverse 3D environments from a single image, enhancing AI training and research towards general artificial intelligence through complex interactions and memory capabilities.
Read original articleGoogle DeepMind has introduced Genie 2, a large-scale foundation world model designed to generate diverse, action-controllable 3D environments for training and evaluating AI agents. Building on the capabilities of its predecessor, Genie 1, which focused on 2D worlds, Genie 2 can create a vast array of rich 3D environments based on a single prompt image. This model allows for rapid prototyping of interactive experiences, enabling researchers to experiment with novel environments and train embodied AI agents effectively. Genie 2 simulates virtual worlds, predicting the consequences of actions taken within them, and demonstrates emergent capabilities such as complex character animations, physics modeling, and object interactions. It can generate consistent environments for up to a minute, responding intelligently to user inputs. The model also supports the creation of counterfactual experiences and long-horizon memory, allowing it to remember and accurately render parts of the world that are no longer in view. Genie 2's ability to transform concept art into interactive environments accelerates the creative process for designers and researchers alike. Although still in early development, Genie 2 represents a significant advancement in the quest for general artificial intelligence (AGI) by addressing the challenges of training embodied agents in diverse and complex settings.
- Genie 2 generates diverse 3D environments from a single image prompt.
- It enables rapid prototyping for interactive experiences and AI training.
- The model simulates complex interactions and maintains consistency in environments.
- Genie 2 supports counterfactual experiences and long-horizon memory.
- It aims to advance research towards achieving general artificial intelligence (AGI).
Related
GenAI does not Think nor Understand
GenAI excels in language processing but struggles with logic-based tasks. An example reveals inconsistencies, prompting caution in relying on it. PartyRock is recommended for testing language models effectively.
Gemini Pro 1.5 experimental "version 0801" available for early testing
Google DeepMind's Gemini family of AI models, particularly Gemini 1.5 Pro, excels in multimodal understanding and complex tasks, featuring a two million token context window and improved performance in various benchmarks.
Google Releases Powerful AI Image Generator You Can Use for Free
Google launched Imagen 3, a free AI image generator in the U.S., producing images in 30 seconds with improved detail. It has restrictions on certain requests and raises copyright concerns.
New AI model can hallucinate a game of 1993's Doom in real time
Researchers from Google and Tel Aviv University developed GameNGen, an AI model that simulates Doom in real time, generating over 20 frames per second, but faces challenges with graphical glitches and visual consistency.
Roblox announces AI model for 3D game worlds
Roblox is developing an open-source generative AI tool to help users create 3D environments from text prompts, enhancing accessibility and aiming to capture 10% of global gaming content revenue.
That said; I cannot find any:
- architecture explanation
- code
- technical details
- API access information
Feels very DeepMind / 2015, and that's a bummer. I think the point of the "we have no moat" email has been taken to heart at Google, and they continue to be on the path of great demos, bleh product launches two years later, and no open access in the interim.
That said, just knowing this is possible - world navigation based on a photo and a text description with up to a minute of held context -- is amazing, and I believe will inspire some groups out there to put out open versions.
Are there any game developers working on infinite story games? I don’t care if it looks like Minecraft, I want a Minecraft that tells intriguing stories with infinite quest generation. Procedural infinite world gen recharged gaming, where is the procedural infinite story generation?
Still, awesome demo. I imagine by the time my kids are in their prime video game age (another 5 years or so) we will be in a new golden age of interactive story telling.
Hey siri, tell me the epic of Gilgamesh over 40 hours of gameplay set 50,000 years in the future where genetic engineering has become trivial and Enkidu is a child’s creation.
A key reason why current Large Multimodal Models (LMMs) still have inferior visual understanding compared to humans is their lack of deep comprehension of the 3D world. Such understanding requires movement, interaction, and feedback from the physical environment. Models that incorporate these elements will likely yield much more capable LMMs.
As a result, we can expect significant improvements in robotics and self-driving cars in the near future.
Simulations + Limited robot data from labs + Algorithms advancement --> Better spatial intelligence
which will lead to a positive feedback loop:
Better spatial intelligence --> Better robots --> More robot deployment --> Better spatial intelligence --> ...
> That large alien? That's a tree. > That other large alien? It's a bush. > That herd of small creatures? Fugghedaboutit > The lightning storm? I can do one lightning pole. > Those towering baobob/acacia hybrids? Actually only two stories tall.
It feels so insulting to the concept artist to show those two videos off.
That one would reimagine the world any time you look at the sky or ground. Sounds like Genie2 solves that: "Genie 2 is capable of remembering parts of the world that are no longer in view and then rendering them accurately when they become observable again."
Something I find interesting about generative AI is how it adds a huge layer of flexibility, but at the cost of lots of computation, while a very narrow set of constraints (a traditional program) is comparatively incredibly efficient.
If someone spent a ton of time building out something simple in Unity, they could get the same thing running with a small fraction of the computation, but this has seemingly infinite flexibility based on so little and that's just incredible.
The reason I mention it is because I'm interested in where we end up using these. Will traditional programming be used for most "production" workloads with gen AI being used to aid in the prototyping and development of those traditional programs, or will we get to the point where our gen AI is the primary driver of software?
I assume that concrete code will always be faster and the best way to have deterministic results, but I really have to idea how to conceptualize what the future looks like now.
Games are about interactions, and this actively works against it. You don't want the model to infer mechanics, the designer needs deep control over every aspect of it.
People mentioned using this for prototyping a game, but that's completely meaningless. What would it even mean to use this to prototype something? It doesn't help you figure out anything mechanically or visually. It's just, "what if you were an avatar in a world?" What do you do after you run around with your random character controller in your random environments?
I think the most useful part of this is the world generation part, not the mechanics inference part.
I would consider a different approach, when the training phase watches games (or video recordings) and refines the formulas that describe its physics, the geometry of the area, the optics, etc. The result would be a "map" that is "playable" without much if any inference involved, and with no time limitation dictated by the size of the context to keep.
Very certainly, video game map generation by AI is a thing, and creating models of motion by watching and then fitting reasonably simple functions (fewer than millions of parameters) is also known.
I cannot be the first person to think about such possibilities, so I wonder what does the current SOTA look like there.
I will add the totally inconsistent backgrounds in the "prototyping" example suggests the AI is simply cribbing from four different games with a flying avatar, which makes it kind of useless unless you're prototyping cynical AI slop. And what are we even doing here by calling this a "world model" if the details of the world can change on a whim? In my world model I can imagine a small dragon flying through my friend's living room without needing to turn her electric lights into sconces and fireplaces.
To state the obvious: if you train your model on thousands of hours of video games, you're also gonna get a bunch of stuff like "leaves are flat and don't bend" or "sometimes humans look like plastic" or "sometimes dragons clip through the scenery," which wouldn't fly in an actual world model. Just call it "video game world model!" Google is intentionally misusing a term which (although mysterious) has real meaning in cognitive science.
I am sure Genie 2 took an awful lot of work and technical expertise. But this advertisement isn't just unscientific, it's an assault on language itself.
I have a feeling many AI researchers are trying to fix things which are not broken.
Game engines are not broken, no reasonable amount of AI TFlops going to approach a professional with UE5. DAWs are not broken, no reasonable amount of AI TFlops going to approach a professional with Steinberg Cubase and Apple Logic.
I wonder why so many AI researchers are trying to generate the complete output with their models, as opposed to training model to generate some intermediate representation and/or realtime commands for industry-standard software?
Interesting they're framing this more from the world model/agent environment angle, when this seems like the best example so far of generative games.
720p realtime mostly consistent games for a minute is amazing, considering stable diffusion was originally released 2ish years ago.
People of the internet, you were right. Now, this is incredible.
Seems that it's only "consistent" up to a minute, but if the progress keeps the same rate.. just wow.
This is huge, the Minecraft demos we saw recently we're just toys because you couldn't actually do anything in them.
For the time being I will gloss over the fact this might just be a consumer facing product for Google that ends up having nothing to do with younger developers.
I'm torn between two ideas:
a. Show kids awesome stuff that motivates them to code
b. Show kids how to code something that might not be as awesome, but they actually made it
On the one hand you want to show kids something cool and get them motivated. What Google is doing here is certainly capable of doing that.
On the other hand I want to show kids what they can actually do and empower them. The days of making a game on your own in your basement are mostly dead, but I don't think that means the idea of being someone who can control a large amount of your vision - both technical and non-technical - is important.
Not everyone is the same either. I have met kids that would never spend a few hours to learn some Python with pygame to get a few rectangles and sprites on screen that might get more interested if they saw something this flashy. But experience also tells me those kids are extremely less likely to get much value from a tool like this beyond entertainment.
I have a 14 year old son myself and I struggle to understand how he sees the world in this capacity sometimes. I don't understand what he thinks is easy or hard and it warps his expectations drastically. I come from a time period where you would grind for hours at a terminal pecking in garbage from a magazine to see a few seconds of crappy graphics. I don't think there should be meaningless labor attached to programming for no reason, but I also think that creating a "cost" to some degree may have helped us. Given two programs to peck into the terminal, which one do you peck? Very few of us had the patience (and lack of sanity) to peck them all.
It's fascinating how much understanding of the world is being extracted and learned by these models in order to do this. (For the 'that's not really understanding' crowd, what definition of 'understanding' are you using?)
The current tooling we have is just way too good to just discard it, think of Maya, Blender and the like. How will these interfaces, with the tools they already provide, enable sculpting these word-based worlds?
I wonder if some kind of translator will be required, one which precisely instructs "User holds a brush pointing 33° upwards and 56° to the left of the world's x-axis with a brush consisting of ... applied with a strength of ...", or how this will be translated into embeddings or whatever that will be required to communicate with that engine.
This is probably the most exciting time for the CG industry in decades, and this means a lot, since we've been seeing incredible progress in every area of traditional CG generation. Also a scary time for those who learned the skills and will now occasionally see some random persons doing incredible visuals with zero knowledge of the entire CG pipeline.
What this should say to you instead is that stuff is really bad on training data side if you start scraping billions of game streams on internet - hard to imagine if there is a bigger chunk of training data than this. Stagnation incoming.
Can we let another models generate in this models's world and vice versa?
What if both output in a single instance of a world? What if both output in their own private world and only share data about location and some other metrics?
But, I didn’t expect this much progress towards that quite this fast…
Of all things this must be the most boring use case for this crazy looking new technology. But hybrid video meetings have always annoyed me and I think to myself that surely there must be a better way (and why hasn't it arrived yet?).
I love the advancement of the tech but this still looks very young and I'd be curious what the underlying output code looks like (how well it's formatted, documented, organized, optimized, etc.)
Also, this seems oddly related to the recent post from WorldLabs https://www.worldlabs.ai/blog. Wonder if this was timed to compete directly and overtake the related news cycle.
This also means that my dreams will keep looking like this iteration of Genie 2, but computer will scale up and the worlds won't look anything like my dreams anymore in next versions (its already more colorful anyway).
I remember image generation use to look like dreams too in the beginning. Now it doesn't look anything like that.
I remember there was a brief window where some gamers bought a Physx card for high fidelity physics in games. Ultimately they rolled that tech in to the CPUs themselves right?
Depending on how controllable the tech ends up being, I suppose. Could be anywhere from a gimmick (which is still nice) to a game engine replacement.
We'll see which so-called AI-companies are really "dying" when either a correction, market crash or a new AI winter arrives.
Lighting, gravity, character animation and what not internalized by the model... from a single image...!
Statistical models will output a compressed mishmash of what they were trained on.
No matter how hard they try to cover that inherent basic reality, it is still there.
Not to mention the upkeep of training on new "creative" material on a regular basis and the never ending bugs due to non-determinism. Aside from contrived cases for looking up and synthesizing information (Search Engine 2.0).
The Tech Industry is over investing in this area exposing an inherent bias towards output rather than solving actual problems for humanity.
Yippee finally google posts a non confirming cookie popup with no way to reject the ad cookies!
I'll keep my stance, give it two years and very realistic movies, with plot and everything, will be generated on demand.
What you haven't been able to do so far, after many years of trying, is to go from the virtual, to the real. Go from Arcanoid to a robot that can play, I dunno, squash, without dying. A robot that can navigate an arbitrary physical location without drowning, or falling off a cliff, or getting run over by a bus. Or build any Lego kit from instructions. Where's all that?
You've conquered games. Bravo! Now where's the real world autonomy?
Related
GenAI does not Think nor Understand
GenAI excels in language processing but struggles with logic-based tasks. An example reveals inconsistencies, prompting caution in relying on it. PartyRock is recommended for testing language models effectively.
Gemini Pro 1.5 experimental "version 0801" available for early testing
Google DeepMind's Gemini family of AI models, particularly Gemini 1.5 Pro, excels in multimodal understanding and complex tasks, featuring a two million token context window and improved performance in various benchmarks.
Google Releases Powerful AI Image Generator You Can Use for Free
Google launched Imagen 3, a free AI image generator in the U.S., producing images in 30 seconds with improved detail. It has restrictions on certain requests and raises copyright concerns.
New AI model can hallucinate a game of 1993's Doom in real time
Researchers from Google and Tel Aviv University developed GameNGen, an AI model that simulates Doom in real time, generating over 20 frames per second, but faces challenges with graphical glitches and visual consistency.
Roblox announces AI model for 3D game worlds
Roblox is developing an open-source generative AI tool to help users create 3D environments from text prompts, enhancing accessibility and aiming to capture 10% of global gaming content revenue.