August 28th, 2024

Diffusion Models Are Real-Time Game Engines

GameNGen, developed by Google and Tel Aviv University, simulates DOOM in real-time at over 20 frames per second using a two-phase training process, highlighting the potential of neural models in gaming.

Read original articleLink Icon
CuriositySkepticismExcitement
Diffusion Models Are Real-Time Game Engines

GameNGen is a novel game engine developed by researchers from Google and Tel Aviv University, utilizing a neural model to enable real-time interaction within complex environments. It successfully simulates the classic game DOOM at over 20 frames per second on a single TPU, achieving a peak signal-to-noise ratio (PSNR) of 29.4, which is comparable to lossy JPEG compression. The model's effectiveness is demonstrated by human raters who struggle to distinguish between actual gameplay and simulated clips. GameNGen's training involves two phases: first, a reinforcement learning (RL) agent plays the game, generating training data from its actions and observations; second, a diffusion model is trained to predict the next frame based on previous frames and actions. To enhance stability during long gameplay sequences, conditioning augmentations are applied, and Gaussian noise is introduced to the context frames during training. Additionally, the latent decoder of the pre-trained Stable Diffusion model is fine-tuned to improve image quality, particularly for small details. The research highlights the potential of diffusion models in real-time game engine applications.

- GameNGen simulates DOOM in real-time at over 20 frames per second.

- The model uses a two-phase training process involving reinforcement learning and diffusion modeling.

- Human raters find it challenging to differentiate between real and simulated gameplay.

- Conditioning augmentations and noise addition are critical for maintaining visual stability.

- The project showcases the capabilities of neural models in gaming technology.

Related

HybridNeRF: Efficient Neural Rendering

HybridNeRF: Efficient Neural Rendering

HybridNeRF combines surface and volumetric representations for efficient neural rendering, achieving 15-30% error rate improvement over baselines. It enables real-time framerates of 36 FPS at 2K×2K resolutions, outperforming VR-NeRF in quality and speed on various datasets.

Doom on Playdate

Doom on Playdate

Nic Magnier successfully ported Doom to the Playdate, facing challenges with makefiles and compilers. He plans to enhance controls and optimize the game, aiming to integrate features like using the crank for interactions. Despite encountering crashes, Nic remains committed to refining the port. The community eagerly awaits further developments.

PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

PhysGen is a novel method for generating realistic videos from a single image using physical simulation and data-driven techniques, developed by researchers from the University of Illinois and Apple.

GPUDrive: Data-driven, multi-agent driving simulation at 1M FPS

GPUDrive: Data-driven, multi-agent driving simulation at 1M FPS

The paper introduces GPUDrive, a GPU-accelerated simulator that generates over a million experience steps per second, enhancing multi-agent planning and training reinforcement learning agents using the Waymo Motion dataset.

Show HN: I Trained a 2D Game Animation Generation Model(Fully Open-Source)

Show HN: I Trained a 2D Game Animation Generation Model(Fully Open-Source)

The "God Mode Animation" GitHub repository offers an open-source platform for generating 2D game animations from text and images, featuring demo games, trained models, and comprehensive training instructions.

AI: What people are saying
The comments on the article about GameNGen reveal a mix of excitement and skepticism regarding the implications of using diffusion models in gaming.
  • Many commenters are impressed by the technology's ability to simulate DOOM in real-time, noting its potential for future applications in gaming.
  • There is a significant debate about whether diffusion models can truly function as game engines, with some arguing they merely replicate existing games rather than create new ones.
  • Concerns are raised about the lack of interactivity and the model's reliance on pre-existing game data, leading to questions about its practical utility in game development.
  • Several users express curiosity about the future of AI in gaming, including the possibility of generating new game content and improving visual quality.
  • Some comments highlight the need for further exploration and experimentation with AI in gaming, suggesting that the current implementation is just the beginning.
Link Icon 88 comments
By @vessenes - 8 months
So, this is surprising. Apparently there’s more cause, effect, and sequencing in diffusion models than what I expected, which would be roughly ‘none’. Google here uses SD 1.4, as the core of the diffusion model, which is a nice reminder that open models are useful to even giant cloud monopolies.

The two main things of note I took away from the summary were: 1) they got infinite training data using agents playing doom (makes sense), and 2) they added Gaussian noise to source frames and rewarded the agent for ‘correcting’ sequential frames back, and said this was critical to get long range stable ‘rendering’ out of the model.

That last is intriguing — they explain the intuition as teaching the model to do error correction / guide it to be stable.

Finally, I wonder if this model would be easy to fine tune for ‘photo realistic’ / ray traced restyling — I’d be super curious to see how hard it would be to get a ‘nicer’ rendering out of this model, treating it as a doom foundation model of sorts.

Anyway, a fun idea that worked! Love those.

By @wkcheng - 8 months
It's insane that that this works, and that it works fast enough to render at 20 fps. It seems like they almost made a cross between a diffusion model and an RNN, since they had to encode the previous frames and actions and feed it into the model at each step.

Abstractly, it's like the model is dreaming of a game that it played a lot of, and real time inputs just change the state of the dream. It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.

By @SeanAnderson - 8 months
After some discussion in this thread, I found it worth pointing out that this paper is NOT describing a system which receives real-time user input and adjusts its output accordingly, but, to me, the way the abstract is worded heavily implied this was occurring.

It's trained on a large set of data in which agents played DOOM and video samples are given to users for evaluation, but users are not feeding inputs into the simulation in real-time in such a way as to be "playing DOOM" at ~20FPS.

There are some key phrases within the paper that hint at this such as "Key questions remain, such as ... how games would be effectively created in the first place, including how to best leverage human inputs" and "Our end goal is to have human players interact with our simulation.", but mostly it's just the omission of a section describing real-time user gameplay.

By @zzanz - 8 months
The quest to run doom on everything continues. Technically speaking, isn't this the greatest possible anti-Doom, the Doom with the highest possible hardware requirement? I just find it funny that on a linear scale of hardware specification, Doom now finds itself on both ends.
By @godelski - 8 months
Doom system requirements:

  - 4 MB RAM
  - 12 MB disk space 
Stable diffusion v1

  > 860M UNet and CLIP ViT-L/14 (540M)
  Checkpoint size:
    4.27 Gb 
    7.7 GB (full EMA)
  Running on a TPU-v5e
    Peak compute per chip (bf16)  197 TFLOPs
    Peak compute per chip (Int8)  393 TFLOPs
    HBM2 capacity and bandwidth  16 GB, 819 GBps
    Interchip Interconnect BW  1600 Gbps
This is quite impressive, especially considering the speed. But there's still a ton of room for improvement. It seems it didn't even memorize the game despite having the capacity to do so hundreds of times over. So we definitely have lots of room for optimization methods. Though who knows how such things would affect existing tech since the goal here is to memorize.

What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute. I'm curious what the comparison in cost and time would be if you hired an engineer to reverse engineer Doom (how much prior knowledge do they get considering pertained models and visdoom environment. Was doom source code in T5? And which vit checkpoint was used? I can't keep track of Google vit checkpoints).

I would love to see the checkpoint of this model. I think people would find some really interesting stuff taking it apart.

- https://www.reddit.com/r/gaming/comments/a4yi5t/original_doo...

- https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin...

- https://cloud.google.com/tpu/docs/v5e

- https://github.com/Farama-Foundation/ViZDoom

- https://zdoom.org/index

By @Sohcahtoa82 - 8 months
It's always fun reading the dead comments on a post like this. People love to point how how pointless this is.

Some of ya'll need to learn how to make things for the fun of making things. Is this useful? No, not really. Is it interesting? Absolutely.

Not everything has to be made for profit. Not everything has to be made to make the world a better place. Sometimes, people create things just for the learning experience, the challenge, or they're curious to see if something is possible.

Time spent enjoying yourself is never time wasted. Some of ya'll are going to be on your death beds wishing you had allowed yourself to have more fun.

By @HellDunkel - 8 months
Although impressive i must disagree. Diffusion models are not game engines. A game engine is a component to propell your game (along the time axis?). In that sense it is similar to the engine of the car, hence the name. It does not need a single working car nor a road to drive on do its job. The above is a dynamic, interactive replication of what happens when you put a car on a given road, requiring a million test drives with working vehicles. An engine would also work offroad.
By @refibrillator - 8 months
There is no text conditioning provided to the SD model because they removed it, but one can imagine a near future where text prompts are enough to create a fun new game!

Yes they had to use RL to learn what DOOM looks like and how it works, but this doesn’t necessarily pose a chicken vs egg problem. In the same way that LLMs can write a novel story, despite only being trained on existing text.

IMO one of the biggest challenges with this approach will be open world games with essentially an infinite number of possible states. The paper mentions that they had trouble getting RL agents to completely explore every nook and corner of DOOM. Factorio or Dwarf Fortress probably won’t be simulated anytime soon…I think.

By @danjl - 8 months
So, diffusion models are game engines as long as you already built the game? You need the game to train the model. Chicken. Egg?
By @dtagames - 8 months
A diffusion model cannot be a game engine because a game engine can be used to create new games and modify the rules of existing games in real time -- even rules which are not visible on-screen.

These tools are fascinating but, as with all AI hype, they need a disclaimer: The tool didn't create the game. It simply generated frames and the appearance of play mechanics from a game it sampled (which humans created).

By @alkonaut - 8 months
The job of the game engine is also to render the world given only the worlds properties (textures, geometries, physics rules, ...), and not given "training data that had to be supplied from an already written engine".

I'm guessing that the "This door requires a blue key" doesn't mean that the user can run around, the engine dreams up a blue key in some other corner of the map, and the user can then return to the door and the engine now opens the door? THAT would be impressive. It's interesting to think that all that would be required for that task to go from really hard to quite doable, would be that the door requiring the blue key is blue, and the UI showing some icon indicating the user possesses the blue key. Without that, it becomes (old) hidden state.

By @helloplanets - 8 months
So, any given sequence of inputs is rebuilt into a corresponding image, twenty times per second. I wonder how separate the game logic and the generated graphics are in the fully trained model.

Given a sufficient enough separation between these two, couldn't you basically boil the game/input logic down to an abstract game template? Meaning, you could just output a hash that corresponds to a specific combination of inputs, and then treat the resulting mapping as a representation of a specific game's inner workings.

To make it less abstract, you could save some small enough snapshot of the game engine's state for all given input sequences. This could make it much less dependent to what's recorded off of the agents' screens. And you could map the objects that appear in the saved states to graphics, in a separate step.

I imagine this whole system would work especially well for games that only update when player input is given: Games like Myst, Sokoban, etc.

By @panki27 - 8 months
> Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation.

I can hardly believe this claim, anyone who has played some amount of DOOM before should notice the viewport and textures not "feeling right", or the usually static objects moving slightly.

By @golol - 8 months
What I understand is the folloeing: If this works so well, why didn't we have good video generation much earlier? After diffusion models were seen to work the most obvious thing to do was to generate the next frame based on previous framrs but... it took 1-2 years for good video models to appear. For example compare Sora generating minecraft video versus this method generating minecraft video. Say in both cases the player is standing on a meadow with fee inputs and watching some pigs. In the Sora video you'd expect the typical glitched to appear, like erratic, sliding movement, overlapping legs, multiplication of pigs etc. Would these glitches not appear in the GameNGen video? Why?
By @mo_42 - 8 months
An implementation of the game engine in the model itself is theoretically the most accurate solution for predicting the next frame.

I'm wondering when people will apply this to other areas like the real world. Would it learn the game engine of the universe (ie physics)?

By @icoder - 8 months
This is impressive. But at the same time, it can't count. We see this every time, and I understand why it happens, but it is still intriguing. We are so close or in some ways even way beyond, and yet at the same time so extremely far away, from 'our' intelligence.

(I say it can't count because there are numerous examples where the bullet count glitches, it goes right impressively often, but still, counting, being up or down, is something computers have been able to do flawlessly basically since forever)

(It is the same with chess, where the LLM models are becoming really good, yet sometimes make mistakes that even my 8yo niece would not make)

By @lIl-IIIl - 8 months
How does it know how many times it needs to shoot the zombie before it dies?

Most enemies have enough hit points to survive the first shot. If the model is only trained on the previous frame, it doesn't know how many times the enemy was already shot at.

From the video it seems like it is probability based - they may die right away or it might take way longer than it should.

I love how the player's health goes down when he stands in the radioactive green water.

In Doom the enemies fight with each other if they accidentally incur "friendly fire". It would be interesting to see it play out in this version.

By @masterspy7 - 8 months
There's been a ton of work to generate assets for games using AI: 3d models, textures, code, etc. None of that may even be necessary with a generative game engine like this! If you could scale this up, train on all games in existence, etc. I bet some interesting things would happen
By @nolist_policy - 8 months
Makes me wonder... If you stand still in front of a door so all past observations only contain that door, will the model teleport you to another level when opening the door?
By @smusamashah - 8 months
Has this model actually learned the 3d space of the game? Is it possible to break the camera free and roam around the map freely and view it from different angles?

I noticed a few hallucinations e.g. when it picked green jacket from a corner, walking back it generated another corner. Therefore I don't think it has any clue about the 3D world of the game at all.

By @ravetcofx - 8 months
There is going to be a flood of these dreamlike "games" in the next few years. This feels likes a bit of a breakthrough in the engineering of these systems.
By @Kapura - 8 months
What is useful about this? I am a game programmer, and I cannot imagine a world where this improves any part of the development process. It seems to me to be a way to copy a game without literally copying the assets and code; plagiarism with extra steps. What am I missing?
By @arduinomancer - 8 months
How does the model “remember” the whole state of the world?

Like if I kill an enemy in some room and walk all the way across the map and come back, would the body still be there?

By @rldjbpin - 8 months
this is truly a cool demo, but a very misleading title.

to me it seems like a very bruteforce or greedy way to give the impression to a user that they are "playing" a game. the difference being that you already own the game to make this possible, but cannot let the user use that copy!

using generative AI for game creation is at a nascent stage but there are much more elegant ways to go about the end goal. perhaps in the future with computing so far ahead that we moved beyond the current architecture, this might be worth doing instead of emulation perhaps.

By @dabochen - 8 months
So there is no interactivity, but the generated content is not the exact view in the training data, is this the correct understanding?

If so, is it more like imagination/hallucination rather than rendering?

By @rrnechmech - 8 months
> To mitigate auto-regressive drift during inference, we corrupt context frames by adding Gaussian noise to encoded frames during training. This allows the network to correct information sampled in previous frames, and we found it to be critical for preserving visual stability over long time periods.

I get this (mostly). But would any kind soul care to elaborate on this? What is this "drift" they are trying to avoid and how does (AFAIU) adding noise help?

By @jamilton - 8 months
I wonder if the MineRL (https://www.ijcai.org/proceedings/2019/0339.pdf and minerl.io) dataset would be sufficient to reproduce this work with Minecraft.

Any other similar existing datasets?

A really goofy way I can think of to get a bunch of data would be to get videos from youtube and try to detect keyboard sounds to determine what keys they're pressing.

By @throwthrowuknow - 8 months
Several thoughts for future work:

1. Continue training on all of the games that used the Doom engine to see if it is capable of creating new graphics, enemies, weapons, etc. I think you would need to embed more details for this perhaps information about what is present in the current level so that you could prompt it to produce a new level from some combination.

2. Could embedding information from the map view or a raytrace of the surroundings of the player position help with consistency? I suppose the model would need to predict this information as the neural simulation progressed.

3. Can this technique be applied to generating videos with consistent subjects and environments by training on a camera view of a 3D scene and embedding the camera position and the position and animation states of objects and avatars within the scene?

4. What would the result of training on a variety of game engines and games with different mechanics and inputs be? The space of possible actions is limited by the available keys on a keyboard or buttons on a controller but the labelling of the characteristics of each game may prove a challenge if you wanted to be able to prompt for specific details.

By @bufferoverflow - 8 months
That's probably how our reality is rendered.
By @TheRealPomax - 8 months
If by "game" you mean "literal hallucination" then yes. But if we're not trying to click-bait, then no: it's not really a game when there is no permanence or determinism to be found anywhere. It might be a "game-flavoured dream simulator", but it's absolutely not a game engine.
By @t1c - 8 months
They got DOOM running on a diffusion engine before GTA 6
By @broast - 8 months
Maybe one day this will be how operating systems work.
By @KhoomeiK - 8 months
NVIDIA did something similar with GANs in 2020 [1], except users could actually play those games (unlike in this diffusion work which just plays back simulated video). Sentdex later adapted this to play GTA with a really cool demo [2].

[1] https://research.nvidia.com/labs/toronto-ai/gameGAN/

[2] https://www.youtube.com/watch?v=udPY5rQVoW0

By @dysoco - 8 months
Ah finally we are starting to see something gaming related. I'm curious as to why we haven't seen more of neural networks applied to games even in a completely experimental fashion; we used to have a lot of little experimental indie games such as Façade (2005) and I'm surprised we don't have something similar years after the advent of LLMs.

We could have mods for old games that generate voices for the characters for example. Maybe it's unfeasible from a computing perspective? There are people running local LLMs, no?

By @troupo - 8 months
Key: "predicts next frame, recreates classic Doom". A game that was analyzed and documented to death. And the training included uncountable runs of Doom.

A game engine lets you create a new game, not predict the next frame of an existing and copiously documented one.

This is not a game engine.

Creating a new good game? Good luck with that.

By @throwmeaway222 - 8 months
You know how when you're dreaming and you walk into a room at your house and you're suddenly naked at school?

I'm convinced this is the code that gives Data (ST TNG) his dreaming capabilities.

By @kcaj - 8 months
Take a bunch of videos of the real world and calculate the differential camera motion with optical flow or feature tracking. Call this the video’s control input. Now we can play SORA.
By @jetrink - 8 months
What if instead of a video game, this was trained on video and control inputs from people operating equipment like warehouse robots? Then an automated system could visualize the result of a proposed action or series of actions when operating the equipment itself. You would need a different model/algorithm to propose control inputs, but this would offer a way for the system to validate and refine plans as part of a problem solving feedback loop.
By @yair99dd - 8 months
Yotube user hu-po streams critical in-depth streams of Ai papers. Here is his take on this (and other relevant) paper https://www.youtube.com/live/JZgqQB4Aekc
By @lynx23 - 8 months
Hehe, this sounds like the backstory of a remake of the Terminator, or "I have no mouth, but I must scream." In the aftermath of AI killing off humanity, researchers look deeply into how this could have ahppened. And after a number of dead ends, they finally realize: it was trained, in its infancy, on Doom!
By @wantsanagent - 8 months
Anyone have reliable numbers on the file sizes here? Doom.exe from my searches was around 715k, and with all assets somewhere around 10MB. It looks like the SD 1.4 files are over 2GB, so it's likely we're looking at a 200-2000x increase in file size depending on if you think of this as an 'engine' or the full game.
By @lukol - 8 months
I believe future game engines will be state machines with deterministic algorithms that can be reproduced at any time. However, rendering said state into visual / auditory / etc. experiences will be taken over by AI models.

This will also allow players to easily customize what they experience without changing the core game loop.

By @nuz - 8 months
I wonder how overfit it is though. You could fit a lot of doom resolution jpeg frames into 4gb (the size of SD1.4)
By @JDEngi - 8 months
This is going to be the future of cloud gaming, isn't it? In order to deal with the latency, we just generate the next frame locally, and we'll have the true frame coming in later from the cloud, so we're never dreaming too far ahead of the actual game.
By @KETpXDDzR - 8 months
I think the correct title should be "Diffusion Models Are Fake Real-Time Game Engines". I don't think just more training will ever be sufficient to create a complete game engine. It would need to "understand" what it's doing.
By @ciroduran - 8 months
Congrats on running Doom on an Diffusion Model :D

I was really entranced on how combat is rendered (the grunt doing weird stuff in very much the style that the model generates images). Now I'd like to see this implemented in a shader in a game

By @seydor - 8 months
I wonder how far it is from this to generating language reasoning about the game from the game itself, rather than learning a large corpus of language, like LLMs do. That would be a true grounded language generator
By @golol - 8 months
Certain categories of youtube videos can also be viewed as some sort of game where the actions are the audio/transcript advanced a couple of seconds. Add two eggs. Fetch the ball. I'm walking in the park.
By @darrinm - 8 months
So… is it interactive? Playable? Or just generating a video of gameplay?
By @holoduke - 8 months
I saw a video a while ago where they recreated actual doom footage with a diffusion technique so it looked like a jungle or anything you liked. Cant find it anymore, but looked impressive.
By @jumploops - 8 months
This seems similar to how we use LLMs to generate code: generate, run, fix, generate.

Instead of working through a game, it’s building generic UI components and using common abstractions.

By @qnleigh - 8 months
Could a similar scheme be used to drastically improve the visual quality of a video game? You would train the model on gameplay rendered at low and high quality (say with and without ray tracing, and with low and high density meshing), and try to get it to convert a quick render into something photorealistic on the fly.

When things like DALL-E first came out, I was expecting something like the above to make it into mainstream games within a few years. But that was either too optimistic or I'm not up to speed on this sort of thing.

By @LtdJorge - 8 months
So is it taking inputs from a player and simulating the gameplay or is it just simulating everything (effectively, a generated video)?
By @lackoftactics - 8 months
I think Alan's conservative countdown to AGI will need to be updated after this. https://lifearchitect.ai/agi/ This is really impressive stuff. I thought about it a couple of months ago, that probably this is the next modality worth exploring for data, but didn't imagine it would come so fast. On the other side, the amount of compute required is crazy.
By @acoye - 8 months
Nvidia CEO reckons your GPU will be replaced with AI in “5-10 years”. So this is what the sort of first working game I guess.
By @acoye - 8 months
I'd love to see John Carmack come back from his AGI hiatus and advance AI based rendering. This would be supper cool.
By @amunozo - 8 months
This is amazing and an interesting discovery. It is a pity that I don't find it capable of creating anything new.
By @harha_ - 8 months
This is so sick I don't know what to say. I never expected this, aren't the implications of this huge?
By @maxglute - 8 months
RL tetris effect hallucination.

Wish there was 1000s of hours of hardcore henry to train. Maybe scrape gopro war cams.

By @nicman23 - 8 months
what i want from something like this is a mix. a model that can infinitely "zoom" into an object's texture which even if not perfect it would be fine and a model that would create 3d geometry from bump maps / normals
By @mobiuscog - 8 months
Video Game streamers are next in line to be replaced by AI I guess.
By @EcommerceFlow - 8 months
Jensen said that this is the future of gaming a few months ago fyi.
By @kqr - 8 months
I have been kind of "meh" about the recent AI hype, but this is seriously impressive.

Of course, we're clearly looking at complete nonsense generated by something that does not understand what it is doing – yet, it is astonishingly sensible nonsense given the type of information it is working from. I had no idea the state of the art was capable of this.

By @gwbas1c - 8 months
Am I the only one who thinks this is faked?

It's not that hard to fake something like this: Just make a video of DOSBox with DOOM running inside of it, and then compress it with settings that will result in compression artifacts.

By @amelius - 8 months
Yes, and you can use an LLM to simulate role playing games.
By @piperswe - 8 months
This is honestly the most impressive ML project I've seen since... probably O.G. DALL-E? Feels like a gem in a sea of AI shit.
By @jasonkstevens - 8 months
AI no longer plays Doom-it is Doom.
By @aghilmort - 8 months
looking forward to &/or wondering about overlap with notion of ray tracing LLMs
By @itomato - 8 months
The gibs are a dead giveaway
By @joseferben - 8 months
impressive, imagine this but photo realistic with vr goggles.
By @thegabriele - 8 months
Wow, I bet Boston Dynamics and such are quite interested
By @YeGoblynQueenne - 8 months
Misleading Titles Are Everywhere These Days.
By @danielmarkbruce - 8 months
What is the point of this? It's hard to see how this is useful. Maybe it's just an exercise to show what a diffusion model can do?
By @richard___ - 8 months
Uhhh… demos would be more convincing with enemies and decreasing health
By @dean2432 - 8 months
So in the future we can play FPS games given any setting? Pog
By @sitkack - 8 months
What most programmers don't understand, that in the very near future, the entire application will be delivered by an AI model, no source, no text, just connect to the app over RDP. The whole app will be created by example, the app developer will train the app like a dog trainer trains a dog.