July 4th, 2024

Diffusion Forcing: Next-Token Prediction Meets Full-Sequence Diffusion

Diffusion Forcing combines full-sequence diffusion models and next-token models for generative modeling. It optimizes token likelihoods, excels in video prediction, stabilizes auto-regressive rollout, and enhances robustness in real-world applications.

Read original articleLink Icon
Diffusion Forcing: Next-Token Prediction Meets Full-Sequence Diffusion

Diffusion Forcing is introduced as a novel training paradigm that combines the strengths of full-sequence diffusion models and next-token models for sequence generative modeling. By training a diffusion model to denoise tokens with varying noise levels, Diffusion Forcing can generate future tokens without diffusing past ones, offering benefits such as variable-length generation and guiding sampling to desired trajectories. This approach optimizes a variational lower bound on token likelihoods and allows for flexible behaviors like stabilizing auto-regressive rollout and planning with causal uncertainty. The method excels in video prediction tasks, outperforming baselines in stability and consistency. It enables stable infinite rollout without a sliding window, showcasing its stabilization effect. Additionally, Diffusion Forcing can be used for diffusion planning and long-horizon imitation learning tasks, demonstrating success in non-Markovian scenarios where traditional techniques fail. The method's ability to handle noisy observations enhances robustness in real-world applications.

Related

Diffusion Limited Aggregation (1991)

Diffusion Limited Aggregation (1991)

Diffusion Limited Aggregation (DLA) models random particle movement, creating intricate structures like coral growth. 2D to 3D extension allows complex models resembling natural phenomena. Software available for Mac OS-X enables high-quality rendering.

The Magic of Participatory Randomness

The Magic of Participatory Randomness

Randomness is vital in cryptography, gaming, and civic processes. Techniques like "Finger Dice" enable fair outcomes through participatory randomness, ensuring transparency and trust in provably fair games.

How to generate realistic people in Stable Diffusion

How to generate realistic people in Stable Diffusion

The tutorial focuses on creating lifelike portrait images using Stable Diffusion. It covers prompts, lighting, facial details, blending faces, poses, and models like F222 and Hassan Blend 1.4 for realistic results. Emphasis on clothing terms and model licenses is highlighted.

Show HN: UNet diffusion model in pure CUDA

Show HN: UNet diffusion model in pure CUDA

The GitHub content details optimizing a UNet diffusion model in C++/CUDA to match PyTorch's performance. It covers custom convolution kernels, forward pass improvements, backward pass challenges, and future optimization plans.

Fractional Brownian Motion (2019)

Fractional Brownian Motion (2019)

Fractional Brownian Motion (fBM) is crucial in computer graphics and terrain generation. It uses random increments over time to create self-similar paths, controlled by the Hurst Exponent (H) for smoothness and fractal dimension. Varying noise signals construct natural shapes, with H influencing self-similarity and G affecting amplitude decay. fBM efficiently models terrains and clouds, offering realistic computer-generated environments through parameter manipulation.

Link Icon 7 comments
By @vessenes - 6 months
A number of ideas seem notable to me here; first, they are merging the idea of sequence masking (the key training idea for LLMs) with diffusion models; they do this by keeping track of an ‘uncertainty’ level per pixel. This ‘uncertainty’ level is treated as the ‘noise’ level for the diffusion model, (a model which denoises controlled by some sort of embedding).

There are a bunch of neat things you can do with this: in particular, you can firm up parts of the image earlier than others, and thus use it for, say maze solving. They even show it controlling a robot arm moving fruit around, which is pretty wild.

In a way the title undersells the idea - this is a way to do fractional masking, since the masking level is a float - and I think is really a pretty profound and interesting idea.

However, there’s a lot not talked about in this paper; I’d be very curious to see their codebase. How exactly do you set up a maze-following task vs a video extension task? How do you hook up a robot arm to this model, and tell the model what you want done? The architecture itself deserves a significant number of papers / explication.

By @luke-stanley - 6 months
Anyone know of research or tools for using an existing text generating LLM with diffusion like techniques with no new pre-training, or at most, a bit of fine-tuning, such that it works with a small GPT / Phi 3 / Gwen model, for example? I know about Tree of Thoughts with MCTS etc, that are somewhat similar (though often with a different reward learned goal) but I'm interested in something closer to token level generation. Is this possible?
By @jimsimmons - 6 months
I work in the field and the work is presented in an extremely obtuse manner.

What is the problem you're trying to solve? Are you proposing a new generative model?

By @treprinum - 6 months
Russ is doing diffusion now? Must be very applicable to robotics.
By @blovescoffee - 6 months
Am I missing something about training time? Does adding per token noise cause training to slow significantly? Cool paper though!
By @omerhac - 6 months
Very cool, but why is it called diffusion forcing?