December 4th, 2024

Genie 2: A large-scale foundation world model

Google DeepMind's Genie 2 is a foundation model that creates diverse 3D environments from a single image, enhancing AI training and research towards general artificial intelligence through complex interactions and memory capabilities.

Read original articleLink Icon
Genie 2: A large-scale foundation world model

Google DeepMind has introduced Genie 2, a large-scale foundation world model designed to generate diverse, action-controllable 3D environments for training and evaluating AI agents. Building on the capabilities of its predecessor, Genie 1, which focused on 2D worlds, Genie 2 can create a vast array of rich 3D environments based on a single prompt image. This model allows for rapid prototyping of interactive experiences, enabling researchers to experiment with novel environments and train embodied AI agents effectively. Genie 2 simulates virtual worlds, predicting the consequences of actions taken within them, and demonstrates emergent capabilities such as complex character animations, physics modeling, and object interactions. It can generate consistent environments for up to a minute, responding intelligently to user inputs. The model also supports the creation of counterfactual experiences and long-horizon memory, allowing it to remember and accurately render parts of the world that are no longer in view. Genie 2's ability to transform concept art into interactive environments accelerates the creative process for designers and researchers alike. Although still in early development, Genie 2 represents a significant advancement in the quest for general artificial intelligence (AGI) by addressing the challenges of training embodied agents in diverse and complex settings.

- Genie 2 generates diverse 3D environments from a single image prompt.

- It enables rapid prototyping for interactive experiences and AI training.

- The model simulates complex interactions and maintains consistency in environments.

- Genie 2 supports counterfactual experiences and long-horizon memory.

- It aims to advance research towards achieving general artificial intelligence (AGI).

Link Icon 88 comments
By @vessenes - 5 months
This is.. super impressive. I'd like to know how large this model is. I note that the first thing they have it do is talk to agents who can control the world gen; geez - even robots get to play video games while we work.

That said; I cannot find any:

- architecture explanation

- code

- technical details

- API access information

Feels very DeepMind / 2015, and that's a bummer. I think the point of the "we have no moat" email has been taken to heart at Google, and they continue to be on the path of great demos, bleh product launches two years later, and no open access in the interim.

That said, just knowing this is possible - world navigation based on a photo and a text description with up to a minute of held context -- is amazing, and I believe will inspire some groups out there to put out open versions.

By @erulabs - 5 months
It’s interesting to me that we continue to see such pressure on video and world generation, despite the fact that for years now we’ve gotten games and movies that have beautiful worlds filled with lousy, limited, poorly written stories. Star Wars movies have looked phenomenal for a decade, full of bland stories we’ve all heard a thousand times.

Are there any game developers working on infinite story games? I don’t care if it looks like Minecraft, I want a Minecraft that tells intriguing stories with infinite quest generation. Procedural infinite world gen recharged gaming, where is the procedural infinite story generation?

Still, awesome demo. I imagine by the time my kids are in their prime video game age (another 5 years or so) we will be in a new golden age of interactive story telling.

Hey siri, tell me the epic of Gilgamesh over 40 hours of gameplay set 50,000 years in the future where genetic engineering has become trivial and Enkidu is a child’s creation.

By @freedryk - 5 months
Forget video games. This is a huge step forward for AGI and Robotics. There's a lot of evidence from Neurobiology that we must be running something like this in our brains--things like optical illusions, the editing out of our visual blind spot, the relatively low bandwidth measured in neural signals from our senses to our brain, hallucinations, our ability to visualize 3d shapes, to dream. This is the start of adding all those abilities to our machines. Low bandwidth telepresence rigs. Subatomic VR environments synthesized from particle accelerator data. Glasses that make the world 20% more pleasant to look at. Schizophrenic automobiles. One day a power surge is going to fry your doorbell camera and it'll start tripping balls.
By @nopinsight - 5 months
The real goal of this research is developing models that match or exceed human understanding of the 3D world -- a key step toward AGI.

A key reason why current Large Multimodal Models (LMMs) still have inferior visual understanding compared to humans is their lack of deep comprehension of the 3D world. Such understanding requires movement, interaction, and feedback from the physical environment. Models that incorporate these elements will likely yield much more capable LMMs.

As a result, we can expect significant improvements in robotics and self-driving cars in the near future.

Simulations + Limited robot data from labs + Algorithms advancement --> Better spatial intelligence

which will lead to a positive feedback loop:

Better spatial intelligence --> Better robots --> More robot deployment --> Better spatial intelligence --> ...

By @cptroot - 5 months
For all that this is lauded as a "prototyping tool", it's frustrating to see Genie2 discarding entire portions of the concept art demo. The original images drawn by Max Cant have these beautiful alien creatures. Large ones floating, and small ones being herded(?). Genie2 just ignores these beautiful details entirely:

> That large alien? That's a tree. > That other large alien? It's a bush. > That herd of small creatures? Fugghedaboutit > The lightning storm? I can do one lightning pole. > Those towering baobob/acacia hybrids? Actually only two stories tall.

It feels so insulting to the concept artist to show those two videos off.

By @simonw - 5 months
Related recent project you can try out yourself (Chrome only) which hallucinates new frames of a Minecraft style game: https://oasis.decart.ai/

That one would reimagine the world any time you look at the sky or ground. Sounds like Genie2 solves that: "Genie 2 is capable of remembering parts of the world that are no longer in view and then rendering them accurately when they become observable again."

By @jjice - 5 months
I don't understand this space very well, but this seems incredible.

Something I find interesting about generative AI is how it adds a huge layer of flexibility, but at the cost of lots of computation, while a very narrow set of constraints (a traditional program) is comparatively incredibly efficient.

If someone spent a ton of time building out something simple in Unity, they could get the same thing running with a small fraction of the computation, but this has seemingly infinite flexibility based on so little and that's just incredible.

The reason I mention it is because I'm interested in where we end up using these. Will traditional programming be used for most "production" workloads with gen AI being used to aid in the prototyping and development of those traditional programs, or will we get to the point where our gen AI is the primary driver of software?

I assume that concrete code will always be faster and the best way to have deterministic results, but I really have to idea how to conceptualize what the future looks like now.

By @lifeformed - 5 months
Neat tech but people might mistake this as being useful for game development, where it'll be less helpful than useless.

Games are about interactions, and this actively works against it. You don't want the model to infer mechanics, the designer needs deep control over every aspect of it.

People mentioned using this for prototyping a game, but that's completely meaningless. What would it even mean to use this to prototype something? It doesn't help you figure out anything mechanically or visually. It's just, "what if you were an avatar in a world?" What do you do after you run around with your random character controller in your random environments?

I think the most useful part of this is the world generation part, not the mechanics inference part.

By @nine_k - 5 months
While cool, this also seems utterly wasteful. Video games offer known "analytical" solutions for the interactions that the model provides as a "statistical approximation", so to say.

I would consider a different approach, when the training phase watches games (or video recordings) and refines the formulas that describe its physics, the geometry of the area, the optics, etc. The result would be a "map" that is "playable" without much if any inference involved, and with no time limitation dictated by the size of the context to keep.

Very certainly, video game map generation by AI is a thing, and creating models of motion by watching and then fitting reasonably simple functions (fewer than millions of parameters) is also known.

I cannot be the first person to think about such possibilities, so I wonder what does the current SOTA look like there.

By @brink - 5 months
What is actually of value here? There's no actual game, it's incredibly expensive to compute, the behavior is erratic.. It's cool because it's new - but that will quickly wear off, and once that's gone, what's left? There's insane amounts of money being spent on this, and for what?
By @aithrowawaycomm - 5 months
It is jaw-dropping and dismaying how for-profit AI companies use long-standing terms like "world model" and "physics" when they mean "video game model" and "video game physics." Or, as you can plainly see, "models gravity" when they mean "models Red Dead Redemption 2's gravity function, along with its cinematic lighting effects and Rockstar's distinctively weighty animations." Which is to say Google is not modeling gravity at all.

I will add the totally inconsistent backgrounds in the "prototyping" example suggests the AI is simply cribbing from four different games with a flying avatar, which makes it kind of useless unless you're prototyping cynical AI slop. And what are we even doing here by calling this a "world model" if the details of the world can change on a whim? In my world model I can imagine a small dragon flying through my friend's living room without needing to turn her electric lights into sconces and fireplaces.

To state the obvious: if you train your model on thousands of hours of video games, you're also gonna get a bunch of stuff like "leaves are flat and don't bend" or "sometimes humans look like plastic" or "sometimes dragons clip through the scenery," which wouldn't fly in an actual world model. Just call it "video game world model!" Google is intentionally misusing a term which (although mysterious) has real meaning in cognitive science.

I am sure Genie 2 took an awful lot of work and technical expertise. But this advertisement isn't just unscientific, it's an assault on language itself.

By @Const-me - 5 months
The scrolling doesn’t work in my MS Edge so I opened the page in Firefox. Firefox has “Open Video in New Tab” context menu command. When viewed that way, the videos are not that impressive. Horrible visual quality, Egyptian pyramids of random shapes which cast round shadows, etc.

I have a feeling many AI researchers are trying to fix things which are not broken.

Game engines are not broken, no reasonable amount of AI TFlops going to approach a professional with UE5. DAWs are not broken, no reasonable amount of AI TFlops going to approach a professional with Steinberg Cubase and Apple Logic.

I wonder why so many AI researchers are trying to generate the complete output with their models, as opposed to training model to generate some intermediate representation and/or realtime commands for industry-standard software?

By @bix6 - 5 months
Genuine question: What is the point of telling us about this if we can’t use it? Is it just to flex on everyone?
By @binalpatel - 5 months
This is super impressive.

Interesting they're framing this more from the world model/agent environment angle, when this seems like the best example so far of generative games.

720p realtime mostly consistent games for a minute is amazing, considering stable diffusion was originally released 2ish years ago.

By @devonsolomon - 5 months
Yesterday I laughed with my brother about how harsh people on the internet were about World Labs launch (“you can only walk three steps, this demo sucks!”). I was thinking, “this was unthinkable a few years ago, this is incredible”.

People of the internet, you were right. Now, this is incredible.

By @mdrzn - 5 months
Wow.. I can't even imagine where we'll be in 5 or 10 years from now.

Seems that it's only "consistent" up to a minute, but if the progress keeps the same rate.. just wow.

By @beeflet - 5 months
These game-video models remind me of the dream-like "Mind Game" game described in Ender's Game, because of how it has to spontaneously come up with a new environment to address player input. The game in that book is also described as an AI.
By @bearjaws - 5 months
> Genie 2 is capable of remembering parts of the world that are no longer in view and then rendering them accurately when they become observable again.

This is huge, the Minecraft demos we saw recently we're just toys because you couldn't actually do anything in them.

By @notsylver - 5 months
I doubt it, but it would be interesting if they recorded Stadia sessions and trained on that data (... somehow removing the hud?), seems like it would be the easiest way for them to get the data for this.
By @rndmize - 5 months
These clips feels like watching someone dream in real time. Particularly the door ones, where the environment changes in wild fashion, or the middle NPC one, where you see a character walk into shadow and mostly disappear and a different character walks out.
By @jdlyga - 5 months
It's very cool, but we've gotten too many of these big bold announcements with no payoff. All it takes is a very limited demo and we'd be much happier.
By @ddtaylor - 5 months
This is very impressive technology and I am active in this space. Very active. I make an (unreleased) Steam game that helps users create their own games from not knowing how to program. I also (unknowingly) co-authored tools that K12 and university are using to teach game programming.

For the time being I will gloss over the fact this might just be a consumer facing product for Google that ends up having nothing to do with younger developers.

I'm torn between two ideas:

a. Show kids awesome stuff that motivates them to code

b. Show kids how to code something that might not be as awesome, but they actually made it

On the one hand you want to show kids something cool and get them motivated. What Google is doing here is certainly capable of doing that.

On the other hand I want to show kids what they can actually do and empower them. The days of making a game on your own in your basement are mostly dead, but I don't think that means the idea of being someone who can control a large amount of your vision - both technical and non-technical - is important.

Not everyone is the same either. I have met kids that would never spend a few hours to learn some Python with pygame to get a few rectangles and sprites on screen that might get more interested if they saw something this flashy. But experience also tells me those kids are extremely less likely to get much value from a tool like this beyond entertainment.

I have a 14 year old son myself and I struggle to understand how he sees the world in this capacity sometimes. I don't understand what he thinks is easy or hard and it warps his expectations drastically. I come from a time period where you would grind for hours at a terminal pecking in garbage from a magazine to see a few seconds of crappy graphics. I don't think there should be meaningless labor attached to programming for no reason, but I also think that creating a "cost" to some degree may have helped us. Given two programs to peck into the terminal, which one do you peck? Very few of us had the patience (and lack of sanity) to peck them all.

By @taneq - 5 months
I don't see any mention of DIAMOND (https://diamond-wm.github.io/) which does something pretty similar, training a model to predict a game or otherwise 3D world based on videos of gameplay plus corresponding user inputs.

It's fascinating how much understanding of the world is being extracted and learned by these models in order to do this. (For the 'that's not really understanding' crowd, what definition of 'understanding' are you using?)

By @qwertox - 5 months
This is... something different. It will be interesting to see how we will integrate our current 3D tooling into that prompt-based world. Sometimes a "place a button next to the the door" isn't the same as selecting a button and then clicking on the place next to the door, as it is today, or to sculpt a terrain with a brush, all heavily 3D oriented operations, involving transformation matrix calculations, while that promt-based world is build through words.

The current tooling we have is just way too good to just discard it, think of Maya, Blender and the like. How will these interfaces, with the tools they already provide, enable sculpting these word-based worlds?

I wonder if some kind of translator will be required, one which precisely instructs "User holds a brush pointing 33° upwards and 56° to the left of the world's x-axis with a brush consisting of ... applied with a strength of ...", or how this will be translated into embeddings or whatever that will be required to communicate with that engine.

This is probably the most exciting time for the CG industry in decades, and this means a lot, since we've been seeing incredible progress in every area of traditional CG generation. Also a scary time for those who learned the skills and will now occasionally see some random persons doing incredible visuals with zero knowledge of the entire CG pipeline.

By @enbugger - 5 months
Just like with the images, this will never be at good shape to actually use it for real product as it discards details completely leaving generic 3rd person controller animation.

What this should say to you instead is that stuff is really bad on training data side if you start scraping billions of game streams on internet - hard to imagine if there is a bigger chunk of training data than this. Stagnation incoming.

By @amaurose - 5 months
I am wondering if this sort of thing could be used in the real world, in particular, as navigation helper for a blind pedestrian. Products like Orcam have shown a cam + headphones can more or less easily be packed onto some glasses (for OCR). Navigation helper tools exist since the 80s, but all they basically did until now is scan the environment in a primitive way and use some sort of vibration to alert the user. This is very unspecific, and mostly useless in real life. However, having a vision AI that looks down the path of a blind person could potentially revolutionize this sort of application. For obstacle detection and navigation help. From "Careful, construction site on the sidewalk, 20 meters ahead" to "tactile paving 1 meter to your left". Lets take the game to the streets! If the tech is there, that sounds like a good startup idea...
By @brap - 5 months
While this is very (very) cool, what is the upside to having a model render everything at runtime, vs. having it render the 3D assets during development (or even JIT), and then rendering it as just another game? I can think of many reasons why the latter is preferable.
By @asdaqopqkq - 5 months
First think that comes to mind is what about multiplayer?

Can we let another models generate in this models's world and vice versa?

What if both output in a single instance of a world? What if both output in their own private world and only share data about location and some other metrics?

By @artninja1988 - 5 months
Looking at the list of authors, is this from their open endedness team? I found their position paper on it super convincing https://arxiv.org/abs/2406.02061
By @Stevvo - 5 months
You can see artifacts common in screen-space reflections in the videos. I suspect they are not due to the model rendering reflections based on screen-space information, but the model being trained on games that render reflections in such a manner.
By @m3kw9 - 5 months
“ Generating unlimited diverse training environments for future general agents” it may seem unlimited but up to a limited point there will be a pattern. I don’t buy that an AI can use a static model and train itself with data generated from it
By @corysama - 5 months
For quite a while now David Holz of Midjourney has mused that videogames will be AI generated. Like a theoretical PlayStation 7 with an AI processor replacing the GPU.

But, I didn’t expect this much progress towards that quite this fast…

By @fowlie - 5 months
One cool use case for this could be "generative hybrid video meetings"; when I participate in a teams meeting and the majority is in the same physical room, the video conference software could read the wall camera video feed and generate individual video streams of each person as if they sat just in front of me.

Of all things this must be the most boring use case for this crazy looking new technology. But hybrid video meetings have always annoyed me and I think to myself that surely there must be a better way (and why hasn't it arrived yet?).

By @lacoolj - 5 months
OpenAI launches Sora (quite a while ago now), Google needs to fire back with something else groundbreaking.

I love the advancement of the tech but this still looks very young and I'd be curious what the underlying output code looks like (how well it's formatted, documented, organized, optimized, etc.)

Also, this seems oddly related to the recent post from WorldLabs https://www.worldlabs.ai/blog. Wonder if this was timed to compete directly and overtake the related news cycle.

By @smusamashah - 5 months
Its so much like my lucid dreams where world sometimes stays consistent for a while when I take its control. It's a strange feeling seeing computer hallucinating a world just like I hallucinate a world in dreams.

This also means that my dreams will keep looking like this iteration of Genie 2, but computer will scale up and the worlds won't look anything like my dreams anymore in next versions (its already more colorful anyway).

I remember image generation use to look like dreams too in the beginning. Now it doesn't look anything like that.

By @andelink - 5 months
Is this type of on-the-fly graphics generation more expensive than purely text based LLMs? What is the inference energy impact of these types of models?
By @jckahn - 5 months
At first I was excited to see a new model, but then I saw no indication that the model is open source so I closed the page.
By @dartos - 5 months
> Genie 2 can generate consistent worlds for up to a minute, with the majority of examples shown lasting 10-20s.
By @josvdwest - 5 months
I understand the value of infinite NPC dialogues and story arcs, but why do we need live scene generation? Don't we already get that with procedural generation?
By @sergiotapia - 5 months
Will the GPU go the way of the soundcard, and we will all purchase an "LPU"? Language Processing Unit for AIs to run fast?

I remember there was a brief window where some gamers bought a Physx card for high fidelity physics in games. Ultimately they rolled that tech in to the CPUs themselves right?

By @CaptainFever - 5 months
As a game developer, I'm impressed and thinking of ideas of what to do with this kind of tech. The sailboat example was my favourite.

Depending on how controllable the tech ends up being, I suppose. Could be anywhere from a gimmick (which is still nice) to a game engine replacement.

By @KaoruAoiShiho - 5 months
This is where the GPU limits on China really hurts, Chinese companies have been dropping great proof of concepts but because they have been so compute bottlenecked they can't ever really make something actually competitive or transformative.
By @jerpint - 5 months
I have a sneaking suspicion OpenAI will announce something very similar in a few days
By @xcodevn - 5 months
On a very similar theme, here is the work from World Lab (founded by Fei-Fei Li, ImageNet dataset, et al.) about creating 3D worlds:

https://www.worldlabs.ai/blog

By @ata_aman - 5 months
We're about to have on-demand video content and games simply based on prompts. My prediction is we'll have "prompt marketplaces" where you can gen content based on 3rd party prompts (or your own). 3-5 years.
By @rvz - 5 months
Hmmm.... But we were told on HN that "Google is dying" remember? in reality, is it isn't.

We'll see which so-called AI-companies are really "dying" when either a correction, market crash or a new AI winter arrives.

By @tsunamifury - 5 months
I'm guessing from the demo sophisticated indoor architectures do not work yet.
By @worldmerge - 5 months
This looks really cool. How can I use it? Like can I mix it with Unity/Unreal?
By @k2xl - 5 months
This is impressive, but why are they all looking still like a video game? Could they have this render movie scenes with realistic looking humans? I wonder if it is due to the training set they use being mostly video games?
By @wg0 - 5 months
Google is not coming slow... This is magic. As a casual gamer and someone wanting to make my own game, this is black magic.

Lighting, gravity, character animation and what not internalized by the model... from a single image...!

By @empiricus - 5 months
Feed it the inputs from the real world and then it will recreate in its mind a mirror of the world. Some say this is what we do also, we live in a virtual reality created by our minds.
By @rationalfaith - 5 months
As impressive as this might seem let's think about fundamentals.

Statistical models will output a compressed mishmash of what they were trained on.

No matter how hard they try to cover that inherent basic reality, it is still there.

Not to mention the upkeep of training on new "creative" material on a regular basis and the never ending bugs due to non-determinism. Aside from contrived cases for looking up and synthesizing information (Search Engine 2.0).

The Tech Industry is over investing in this area exposing an inherent bias towards output rather than solving actual problems for humanity.

By @zja - 5 months
I love the outtakes section in the bottom. It made me laugh but it also feels more transparent than a lot of GenAI stuff that’s being announced.
By @42lux - 5 months
I don’t know I get the excitement but as soon as you turn around and there is something completely different behind you it breaks the immersion.
By @mrbungie - 5 months
Google doing the "look how we can do this but you can't and you won't with our help" with more force than ever.
By @aussieguy1234 - 5 months
If it can play video games that simulate the laws of physics, could it control a robot in the physical world?
By @baalimago - 5 months
To me, this is a bit like web3: Can't we already do this? What's the benefit?
By @swyx - 5 months
i was wondering when genie 1 was and... it didtn seem to get much love? https://news.ycombinator.com/item?id=39509937 @dang was there a main thread here?
By @stoicjumbotron - 5 months
Do people within Google get to try it? If yes, how long is the approval process?
By @andsoitis - 5 months
Will the agents in these worlds realize the worlds were sparked by humans?
By @david_shi - 5 months
"On the back part of the step, toward the right, I saw a small iridescent sphere of almost unbearable brilliance. At first I thought it was revolving; then I realised that this movement was an illusion created by the dizzying world it bounded. The Aleph's diameter was probably little more than an inch, but all space was there, actual and undiminished. Each thing (a mirror's face, let us say) was infinite things, since I distinctly saw it from every angle of the universe. I saw the teeming sea; I saw daybreak and nightfall; I saw the multitudes of America; I saw a silvery cobweb in the center of a black pyramid; I saw a splintered labyrinth (it was London); I saw, close up, unending eyes watching themselves in me as in a mirror; I saw all the mirrors on earth and none of them reflected me; I saw in a backyard of Soler Street the same tiles that thirty years before I'd seen in the entrance of a house in Fray Bentos; I saw bunches of grapes, snow, tobacco, lodes of metal, steam; I saw convex equatorial deserts and each one of their grains of sand..."
By @ingen0s - 5 months
So when is Google Glass coming back to spawn this for my pleasure?
By @infinite-hugs - 5 months
Do you want the matrix because this is how you get the matrix
By @me551ah - 5 months
So when I can try this?
By @anthonymax - 5 months
Wow, is this artificial intelligence creating this already?
By @lionkor - 5 months
> deepmind.google uses cookies from Google to deliver and enhance the quality of its services and to analyze traffic. Learn more.

Yippee finally google posts a non confirming cookie popup with no way to reject the ad cookies!

By @diimdeep - 5 months
what for world models be equivalent of ChatGPT for LLM to really blow up in utility?
By @maxglute - 5 months
2000s graphics vibes.
By @rougka - 5 months
Waiting for OpenAI to take this concept and make it into a product
By @robblbobbl - 5 months
Release please
By @bbstats - 5 months
who is asking for this?
By @De_333 - 5 months
looks amazing!
By @xavirodriguez - 5 months
uoou
By @dangoodmanUT - 5 months
this page loads like shit
By @wildermuthn - 5 months
The technology is incredible, but the path to AGI isn't single-player. Qualia is the missing dataset required for AGI. See attention-schema theory for how social pressures lead to qualia-driven minds capable of true intelligence.
By @moralestapia - 5 months
Not even a month ago HN was discussing Ben Affleck's take on actors and AI, somehow taking a side with him and arguing how the tech "it's just not there, etc...".

I'll keep my stance, give it two years and very realistic movies, with plot and everything, will be generated on demand.

By @tigerlily - 5 months
I can.. see this being used to solve crime, even solving unsolved mysteries and cold cases, among other alternative applications.
By @YeGoblynQueenne - 5 months
Hey, DeepMind folks, are you listening? Listen. We believe you: you can conquer any virtual world you put your mind to. Minecraft, Starcraft, Warcraft (?), Atari, anything. You can do it! With the power of RL and Neural Nets. Well done.

What you haven't been able to do so far, after many years of trying, is to go from the virtual, to the real. Go from Arcanoid to a robot that can play, I dunno, squash, without dying. A robot that can navigate an arbitrary physical location without drowning, or falling off a cliff, or getting run over by a bus. Or build any Lego kit from instructions. Where's all that?

You've conquered games. Bravo! Now where's the real world autonomy?