Diffusion Is Spectral Autoregression
Sander Dieleman's blog post explores the relationship between diffusion and autoregressive models in generative modeling, emphasizing their iterative refinement process and the impact of Gaussian noise on image spectra.
Read original articleSander Dieleman's blog post discusses the relationship between diffusion models and autoregressive models in generative modeling, particularly in the context of image generation. He argues that diffusion models can be viewed as performing approximate autoregression in the frequency domain. The post highlights the iterative refinement process common to both paradigms, where complex data generation tasks are broken down into simpler subtasks. Dieleman explains how diffusion models generate images in a coarse-to-fine manner, which can be analyzed using spectral analysis. By applying the Fourier transform, he illustrates how image spectra reveal the distribution of spatial frequencies, showing that natural images often follow a power law in their spectral power density. The post also examines the impact of Gaussian noise on image spectra, demonstrating how noise alters the frequency representation of images. Dieleman concludes that understanding these connections can enhance the application of generative models across various domains, including language and image processing.
- Diffusion models and autoregressive models share a common iterative refinement approach in generative modeling.
- Diffusion models generate images in a coarse-to-fine manner, which can be analyzed through spectral analysis.
- Natural image spectra typically follow a power law, indicating a relationship between frequency and power.
- The addition of Gaussian noise affects the frequency representation of images, highlighting the interplay between noise and image structure.
- Understanding the connections between these modeling paradigms can improve generative modeling applications across different fields.
Related
How to generate realistic people in Stable Diffusion
The tutorial focuses on creating lifelike portrait images using Stable Diffusion. It covers prompts, lighting, facial details, blending faces, poses, and models like F222 and Hassan Blend 1.4 for realistic results. Emphasis on clothing terms and model licenses is highlighted.
Diffusion Forcing: Next-Token Prediction Meets Full-Sequence Diffusion
Diffusion Forcing combines full-sequence diffusion models and next-token models for generative modeling. It optimizes token likelihoods, excels in video prediction, stabilizes auto-regressive rollout, and enhances robustness in real-world applications.
Diffusion Texture Painting
Researchers introduce Diffusion Texture Painting, a method using generative models for interactive texture painting on 3D meshes. Artists can paint with complex textures and transition seamlessly. The innovative approach aims to inspire generative model exploration.
AI models collapse when trained on recursively generated data
Recent research in Nature reveals "model collapse" in AI, where training on data from previous models leads to irreversible defects and misrepresentation of original data, emphasizing the need for genuine human-generated data.
Diffusion Training from Scratch on a Micro-Budget
The paper presents a cost-effective method for training text-to-image generative models by masking image patches and using synthetic images, achieving competitive performance at significantly lower costs.
The lower frequencies (roughly below 4KHz) are created by the vocal chords opening and closing at the fundamental frequency, and harmonics of this fundamental frequency (e.g. 100Hz + 2/3/400Hz etc harmonics), with this frequency spectrum then being modulated by the resonances of the vocal tract which change during pronunciation. What we perceive as speech is primarily the changes to these resonances (aka formants) due to articulation/pronunciation.
The higher frequencies present in speech mostly comes from "white noise" created by the turbulence of forcing air out through closed teeth/etc (e.g. "S" sound), and our perception of these "fricative" speech sounds is based on onset/offset of energy in these higher 4-8KHz frequencies. Frequencies above 8KHz are not very perceptually relevant, and may be filtered out (e.g. not present in analog telephone speech).
Joseph Fourier's solution to the heat-equation (linear diffusion) was in fact the origin of the FT. The high-freq coefficients decay (as -t^2 IIRC) in there; the reverse is also known to be "unstable" (numerically, and is singular from the equillibrium).
More over, the reformulation doesn't immediately reveal some computational speedup, or a better alternative formulation (which is usually a measure of how valuable it is epistemically).
(Edit: note that Heat-equation is more akin to the Fokker-Planck eqn, not actual Diffusion as an SDE as is used in Diffusion models).
Images have a large near-DC component (solid colors) and useful time-domain properties, while human hearing starts at ~20 Hz and the frequencies needed to understand speech range from 300-4 kHz (spitballing based on the bandwidth of analog phones).
What would happen if you built a diffusion model using pink noise to corrupt all coefficients simultaneously? Alternatively what if you used something other than noise (like a direct blur) for the model to reverse?
As such it seems the statement is that stable diffusion is like an autoregressive model which predicts the next set of higher-order FT coefficients from the lower-order ones.
Seems like this is something one could do with a "regular" autoregressive model, has this been tried? Seems obvious so I assume so, but curious how it compares.
Also, you should probably enforce some kind of frequency cutoff later when you're generating the high frequencies so that you don't destroy low frequency details later in the process.
I'm not really sure how current video generating models work, but maybe we could get some insight into them by looking at how current audio models work?
I think we are looking at an auto regression of auto regressions of sorts, where each PSD + phase is used to output the next, right? Probably with different sized windows of persistence as "tokens". But I'm a way out of my depth here!
But another issue not mentioned in the article is that in images we can zoom in/out arbitrarily. So the width of a pixel can change – it might be 1mm in one image, or 1cm in another, or 1m or 1km. Whereas in audio, the “width of a pixel” (the time between two audio samples) is a fixed amount of time – usually 1/44.1kHz, but even if it’s at a different sample rate, we would convert all images to have the same sample rate before training an NN. The equivalent of this for images would be rescaling all images so that a picture of a cat is say 100x100 pixels, while a picture of a tiger is 300x300.
Which, come to think of it, would be potentially an interesting thing to do.
Huh. Does this mean that pink noise would be a better prior for diffusion models than Gaussian noise, as your denoiser doesn’t need to learn to adjust the overall distribution? Or is this shift in practice not a hard thing to learn in the scale of a training run?
> basically an approximate version of the Fourier transform!
You should take a step back and ask “am I actually muddying the water right now?”
Related
How to generate realistic people in Stable Diffusion
The tutorial focuses on creating lifelike portrait images using Stable Diffusion. It covers prompts, lighting, facial details, blending faces, poses, and models like F222 and Hassan Blend 1.4 for realistic results. Emphasis on clothing terms and model licenses is highlighted.
Diffusion Forcing: Next-Token Prediction Meets Full-Sequence Diffusion
Diffusion Forcing combines full-sequence diffusion models and next-token models for generative modeling. It optimizes token likelihoods, excels in video prediction, stabilizes auto-regressive rollout, and enhances robustness in real-world applications.
Diffusion Texture Painting
Researchers introduce Diffusion Texture Painting, a method using generative models for interactive texture painting on 3D meshes. Artists can paint with complex textures and transition seamlessly. The innovative approach aims to inspire generative model exploration.
AI models collapse when trained on recursively generated data
Recent research in Nature reveals "model collapse" in AI, where training on data from previous models leads to irreversible defects and misrepresentation of original data, emphasizing the need for genuine human-generated data.
Diffusion Training from Scratch on a Micro-Budget
The paper presents a cost-effective method for training text-to-image generative models by masking image patches and using synthetic images, achieving competitive performance at significantly lower costs.