Diffusion Models Are Real-Time Game Engines
GameNGen, developed by Google and Tel Aviv University, simulates DOOM in real-time at over 20 frames per second using a two-phase training process, highlighting the potential of neural models in gaming.
Read original articleGameNGen is a novel game engine developed by researchers from Google and Tel Aviv University, utilizing a neural model to enable real-time interaction within complex environments. It successfully simulates the classic game DOOM at over 20 frames per second on a single TPU, achieving a peak signal-to-noise ratio (PSNR) of 29.4, which is comparable to lossy JPEG compression. The model's effectiveness is demonstrated by human raters who struggle to distinguish between actual gameplay and simulated clips. GameNGen's training involves two phases: first, a reinforcement learning (RL) agent plays the game, generating training data from its actions and observations; second, a diffusion model is trained to predict the next frame based on previous frames and actions. To enhance stability during long gameplay sequences, conditioning augmentations are applied, and Gaussian noise is introduced to the context frames during training. Additionally, the latent decoder of the pre-trained Stable Diffusion model is fine-tuned to improve image quality, particularly for small details. The research highlights the potential of diffusion models in real-time game engine applications.
- GameNGen simulates DOOM in real-time at over 20 frames per second.
- The model uses a two-phase training process involving reinforcement learning and diffusion modeling.
- Human raters find it challenging to differentiate between real and simulated gameplay.
- Conditioning augmentations and noise addition are critical for maintaining visual stability.
- The project showcases the capabilities of neural models in gaming technology.
Related
HybridNeRF: Efficient Neural Rendering
HybridNeRF combines surface and volumetric representations for efficient neural rendering, achieving 15-30% error rate improvement over baselines. It enables real-time framerates of 36 FPS at 2K×2K resolutions, outperforming VR-NeRF in quality and speed on various datasets.
Doom on Playdate
Nic Magnier successfully ported Doom to the Playdate, facing challenges with makefiles and compilers. He plans to enhance controls and optimize the game, aiming to integrate features like using the crank for interactions. Despite encountering crashes, Nic remains committed to refining the port. The community eagerly awaits further developments.
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation
PhysGen is a novel method for generating realistic videos from a single image using physical simulation and data-driven techniques, developed by researchers from the University of Illinois and Apple.
GPUDrive: Data-driven, multi-agent driving simulation at 1M FPS
The paper introduces GPUDrive, a GPU-accelerated simulator that generates over a million experience steps per second, enhancing multi-agent planning and training reinforcement learning agents using the Waymo Motion dataset.
Show HN: I Trained a 2D Game Animation Generation Model(Fully Open-Source)
The "God Mode Animation" GitHub repository offers an open-source platform for generating 2D game animations from text and images, featuring demo games, trained models, and comprehensive training instructions.
- Many commenters are impressed by the technology's ability to simulate DOOM in real-time, noting its potential for future applications in gaming.
- There is a significant debate about whether diffusion models can truly function as game engines, with some arguing they merely replicate existing games rather than create new ones.
- Concerns are raised about the lack of interactivity and the model's reliance on pre-existing game data, leading to questions about its practical utility in game development.
- Several users express curiosity about the future of AI in gaming, including the possibility of generating new game content and improving visual quality.
- Some comments highlight the need for further exploration and experimentation with AI in gaming, suggesting that the current implementation is just the beginning.
The two main things of note I took away from the summary were: 1) they got infinite training data using agents playing doom (makes sense), and 2) they added Gaussian noise to source frames and rewarded the agent for ‘correcting’ sequential frames back, and said this was critical to get long range stable ‘rendering’ out of the model.
That last is intriguing — they explain the intuition as teaching the model to do error correction / guide it to be stable.
Finally, I wonder if this model would be easy to fine tune for ‘photo realistic’ / ray traced restyling — I’d be super curious to see how hard it would be to get a ‘nicer’ rendering out of this model, treating it as a doom foundation model of sorts.
Anyway, a fun idea that worked! Love those.
Abstractly, it's like the model is dreaming of a game that it played a lot of, and real time inputs just change the state of the dream. It makes me wonder if humans are just next moment prediction machines, with just a little bit more memory built in.
It's trained on a large set of data in which agents played DOOM and video samples are given to users for evaluation, but users are not feeding inputs into the simulation in real-time in such a way as to be "playing DOOM" at ~20FPS.
There are some key phrases within the paper that hint at this such as "Key questions remain, such as ... how games would be effectively created in the first place, including how to best leverage human inputs" and "Our end goal is to have human players interact with our simulation.", but mostly it's just the omission of a section describing real-time user gameplay.
- 4 MB RAM
- 12 MB disk space
Stable diffusion v1 > 860M UNet and CLIP ViT-L/14 (540M)
Checkpoint size:
4.27 Gb
7.7 GB (full EMA)
Running on a TPU-v5e
Peak compute per chip (bf16) 197 TFLOPs
Peak compute per chip (Int8) 393 TFLOPs
HBM2 capacity and bandwidth 16 GB, 819 GBps
Interchip Interconnect BW 1600 Gbps
This is quite impressive, especially considering the speed. But there's still a ton of room for improvement. It seems it didn't even memorize the game despite having the capacity to do so hundreds of times over. So we definitely have lots of room for optimization methods. Though who knows how such things would affect existing tech since the goal here is to memorize.What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute. I'm curious what the comparison in cost and time would be if you hired an engineer to reverse engineer Doom (how much prior knowledge do they get considering pertained models and visdoom environment. Was doom source code in T5? And which vit checkpoint was used? I can't keep track of Google vit checkpoints).
I would love to see the checkpoint of this model. I think people would find some really interesting stuff taking it apart.
- https://www.reddit.com/r/gaming/comments/a4yi5t/original_doo...
- https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin...
- https://cloud.google.com/tpu/docs/v5e
Some of ya'll need to learn how to make things for the fun of making things. Is this useful? No, not really. Is it interesting? Absolutely.
Not everything has to be made for profit. Not everything has to be made to make the world a better place. Sometimes, people create things just for the learning experience, the challenge, or they're curious to see if something is possible.
Time spent enjoying yourself is never time wasted. Some of ya'll are going to be on your death beds wishing you had allowed yourself to have more fun.
Yes they had to use RL to learn what DOOM looks like and how it works, but this doesn’t necessarily pose a chicken vs egg problem. In the same way that LLMs can write a novel story, despite only being trained on existing text.
IMO one of the biggest challenges with this approach will be open world games with essentially an infinite number of possible states. The paper mentions that they had trouble getting RL agents to completely explore every nook and corner of DOOM. Factorio or Dwarf Fortress probably won’t be simulated anytime soon…I think.
These tools are fascinating but, as with all AI hype, they need a disclaimer: The tool didn't create the game. It simply generated frames and the appearance of play mechanics from a game it sampled (which humans created).
I'm guessing that the "This door requires a blue key" doesn't mean that the user can run around, the engine dreams up a blue key in some other corner of the map, and the user can then return to the door and the engine now opens the door? THAT would be impressive. It's interesting to think that all that would be required for that task to go from really hard to quite doable, would be that the door requiring the blue key is blue, and the UI showing some icon indicating the user possesses the blue key. Without that, it becomes (old) hidden state.
Given a sufficient enough separation between these two, couldn't you basically boil the game/input logic down to an abstract game template? Meaning, you could just output a hash that corresponds to a specific combination of inputs, and then treat the resulting mapping as a representation of a specific game's inner workings.
To make it less abstract, you could save some small enough snapshot of the game engine's state for all given input sequences. This could make it much less dependent to what's recorded off of the agents' screens. And you could map the objects that appear in the saved states to graphics, in a separate step.
I imagine this whole system would work especially well for games that only update when player input is given: Games like Myst, Sokoban, etc.
I can hardly believe this claim, anyone who has played some amount of DOOM before should notice the viewport and textures not "feeling right", or the usually static objects moving slightly.
I'm wondering when people will apply this to other areas like the real world. Would it learn the game engine of the universe (ie physics)?
(I say it can't count because there are numerous examples where the bullet count glitches, it goes right impressively often, but still, counting, being up or down, is something computers have been able to do flawlessly basically since forever)
(It is the same with chess, where the LLM models are becoming really good, yet sometimes make mistakes that even my 8yo niece would not make)
Most enemies have enough hit points to survive the first shot. If the model is only trained on the previous frame, it doesn't know how many times the enemy was already shot at.
From the video it seems like it is probability based - they may die right away or it might take way longer than it should.
I love how the player's health goes down when he stands in the radioactive green water.
In Doom the enemies fight with each other if they accidentally incur "friendly fire". It would be interesting to see it play out in this version.
I noticed a few hallucinations e.g. when it picked green jacket from a corner, walking back it generated another corner. Therefore I don't think it has any clue about the 3D world of the game at all.
Like if I kill an enemy in some room and walk all the way across the map and come back, would the body still be there?
to me it seems like a very bruteforce or greedy way to give the impression to a user that they are "playing" a game. the difference being that you already own the game to make this possible, but cannot let the user use that copy!
using generative AI for game creation is at a nascent stage but there are much more elegant ways to go about the end goal. perhaps in the future with computing so far ahead that we moved beyond the current architecture, this might be worth doing instead of emulation perhaps.
If so, is it more like imagination/hallucination rather than rendering?
I get this (mostly). But would any kind soul care to elaborate on this? What is this "drift" they are trying to avoid and how does (AFAIU) adding noise help?
Any other similar existing datasets?
A really goofy way I can think of to get a bunch of data would be to get videos from youtube and try to detect keyboard sounds to determine what keys they're pressing.
1. Continue training on all of the games that used the Doom engine to see if it is capable of creating new graphics, enemies, weapons, etc. I think you would need to embed more details for this perhaps information about what is present in the current level so that you could prompt it to produce a new level from some combination.
2. Could embedding information from the map view or a raytrace of the surroundings of the player position help with consistency? I suppose the model would need to predict this information as the neural simulation progressed.
3. Can this technique be applied to generating videos with consistent subjects and environments by training on a camera view of a 3D scene and embedding the camera position and the position and animation states of objects and avatars within the scene?
4. What would the result of training on a variety of game engines and games with different mechanics and inputs be? The space of possible actions is limited by the available keys on a keyboard or buttons on a controller but the labelling of the characteristics of each game may prove a challenge if you wanted to be able to prompt for specific details.
We could have mods for old games that generate voices for the characters for example. Maybe it's unfeasible from a computing perspective? There are people running local LLMs, no?
A game engine lets you create a new game, not predict the next frame of an existing and copiously documented one.
This is not a game engine.
Creating a new good game? Good luck with that.
I'm convinced this is the code that gives Data (ST TNG) his dreaming capabilities.
This will also allow players to easily customize what they experience without changing the core game loop.
I was really entranced on how combat is rendered (the grunt doing weird stuff in very much the style that the model generates images). Now I'd like to see this implemented in a shader in a game
Instead of working through a game, it’s building generic UI components and using common abstractions.
When things like DALL-E first came out, I was expecting something like the above to make it into mainstream games within a few years. But that was either too optimistic or I'm not up to speed on this sort of thing.
Wish there was 1000s of hours of hardcore henry to train. Maybe scrape gopro war cams.
Of course, we're clearly looking at complete nonsense generated by something that does not understand what it is doing – yet, it is astonishingly sensible nonsense given the type of information it is working from. I had no idea the state of the art was capable of this.
It's not that hard to fake something like this: Just make a video of DOSBox with DOOM running inside of it, and then compress it with settings that will result in compression artifacts.
Related
HybridNeRF: Efficient Neural Rendering
HybridNeRF combines surface and volumetric representations for efficient neural rendering, achieving 15-30% error rate improvement over baselines. It enables real-time framerates of 36 FPS at 2K×2K resolutions, outperforming VR-NeRF in quality and speed on various datasets.
Doom on Playdate
Nic Magnier successfully ported Doom to the Playdate, facing challenges with makefiles and compilers. He plans to enhance controls and optimize the game, aiming to integrate features like using the crank for interactions. Despite encountering crashes, Nic remains committed to refining the port. The community eagerly awaits further developments.
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation
PhysGen is a novel method for generating realistic videos from a single image using physical simulation and data-driven techniques, developed by researchers from the University of Illinois and Apple.
GPUDrive: Data-driven, multi-agent driving simulation at 1M FPS
The paper introduces GPUDrive, a GPU-accelerated simulator that generates over a million experience steps per second, enhancing multi-agent planning and training reinforcement learning agents using the Waymo Motion dataset.
Show HN: I Trained a 2D Game Animation Generation Model(Fully Open-Source)
The "God Mode Animation" GitHub repository offers an open-source platform for generating 2D game animations from text and images, featuring demo games, trained models, and comprehensive training instructions.