September 3rd, 2024

Diffusion Is Spectral Autoregression

Sander Dieleman's blog post explores the relationship between diffusion and autoregressive models in generative modeling, emphasizing their iterative refinement process and the impact of Gaussian noise on image spectra.

Read original articleLink Icon
Diffusion Is Spectral Autoregression

Sander Dieleman's blog post discusses the relationship between diffusion models and autoregressive models in generative modeling, particularly in the context of image generation. He argues that diffusion models can be viewed as performing approximate autoregression in the frequency domain. The post highlights the iterative refinement process common to both paradigms, where complex data generation tasks are broken down into simpler subtasks. Dieleman explains how diffusion models generate images in a coarse-to-fine manner, which can be analyzed using spectral analysis. By applying the Fourier transform, he illustrates how image spectra reveal the distribution of spatial frequencies, showing that natural images often follow a power law in their spectral power density. The post also examines the impact of Gaussian noise on image spectra, demonstrating how noise alters the frequency representation of images. Dieleman concludes that understanding these connections can enhance the application of generative models across various domains, including language and image processing.

- Diffusion models and autoregressive models share a common iterative refinement approach in generative modeling.

- Diffusion models generate images in a coarse-to-fine manner, which can be analyzed through spectral analysis.

- Natural image spectra typically follow a power law, indicating a relationship between frequency and power.

- The addition of Gaussian noise affects the frequency representation of images, highlighting the interplay between noise and image structure.

- Understanding the connections between these modeling paradigms can improve generative modeling applications across different fields.

Link Icon 13 comments
By @HarHarVeryFunny - 7 months
The high and low frequency components of speech are produced and perceived in different ways.

The lower frequencies (roughly below 4KHz) are created by the vocal chords opening and closing at the fundamental frequency, and harmonics of this fundamental frequency (e.g. 100Hz + 2/3/400Hz etc harmonics), with this frequency spectrum then being modulated by the resonances of the vocal tract which change during pronunciation. What we perceive as speech is primarily the changes to these resonances (aka formants) due to articulation/pronunciation.

The higher frequencies present in speech mostly comes from "white noise" created by the turbulence of forcing air out through closed teeth/etc (e.g. "S" sound), and our perception of these "fricative" speech sounds is based on onset/offset of energy in these higher 4-8KHz frequencies. Frequencies above 8KHz are not very perceptually relevant, and may be filtered out (e.g. not present in analog telephone speech).

By @thho23i4234343 - 7 months
I don't mean to mean but: what is surprising about any of this ?

Joseph Fourier's solution to the heat-equation (linear diffusion) was in fact the origin of the FT. The high-freq coefficients decay (as -t^2 IIRC) in there; the reverse is also known to be "unstable" (numerically, and is singular from the equillibrium).

More over, the reformulation doesn't immediately reveal some computational speedup, or a better alternative formulation (which is usually a measure of how valuable it is epistemically).

(Edit: note that Heat-equation is more akin to the Fokker-Planck eqn, not actual Diffusion as an SDE as is used in Diffusion models).

By @nyanpasu64 - 7 months
> I won’t speculate about why images exhibit this behaviour and sound seemingly doesn’t, but it is certainly interesting (feel free to speculate away in the comments!).

Images have a large near-DC component (solid colors) and useful time-domain properties, while human hearing starts at ~20 Hz and the frequencies needed to understand speech range from 300-4 kHz (spitballing based on the bandwidth of analog phones).

What would happen if you built a diffusion model using pink noise to corrupt all coefficients simultaneously? Alternatively what if you used something other than noise (like a direct blur) for the model to reverse?

By @magicalhippo - 7 months
Not my area, enjoyed the read. It reminded me of how you can decode a scaled-down version of a JPEG image by simply ignoring the higher-order DCT coefficients.

As such it seems the statement is that stable diffusion is like an autoregressive model which predicts the next set of higher-order FT coefficients from the lower-order ones.

Seems like this is something one could do with a "regular" autoregressive model, has this been tried? Seems obvious so I assume so, but curious how it compares.

By @andersbthuesen - 7 months
This post reminded me of a conversation I had with my cousins about language and learning. It’s interesting how (most?) languages seem inherently sequential, while ideas and knowledge tend to have a more hierarchical structure, with a “base frequency” communicating the basic idea and higher frequency overtones adding the nuances. I wonder what implications this might have in teaching current LLMs to reason?
By @WithinReason - 7 months
To me this means that you could significantly speed up image generation by using a lower resolution at the beginning of the generation process and gradually transitioning to higher resolutions. This would also help with the attention mechanism not getting overwhelmed when generating a high resolution image from scratch.

Also, you should probably enforce some kind of frequency cutoff later when you're generating the high frequencies so that you don't destroy low frequency details later in the process.

By @nowayno583 - 7 months
Intuitively, audio is way more sensitive to phase and persistence because of the time domain. So maybe audio models look more like video models instead of image models?

I'm not really sure how current video generating models work, but maybe we could get some insight into them by looking at how current audio models work?

I think we are looking at an auto regression of auto regressions of sorts, where each PSD + phase is used to output the next, right? Probably with different sized windows of persistence as "tokens". But I'm a way out of my depth here!

By @shaunregenbaum - 7 months
This was a fascinating read. I wonder if anyone has done an analysis on the FT structures of various types of data from molecular structures to time series data. Are all domains different, or do they share patterns?
By @jmmcd - 7 months
I was struck by the comparison between audio spectra and image spectra. Image spectra have a strong power law effect, but audio spectra have more power in middle bands. Why? One part of the issue is that the visual spectrum is very narrow (just 1 order of magnitude from red to blue) compared to audio (4 orders of magnitude from 20Hz to 20kHz).

But another issue not mentioned in the article is that in images we can zoom in/out arbitrarily. So the width of a pixel can change – it might be 1mm in one image, or 1cm in another, or 1m or 1km. Whereas in audio, the “width of a pixel” (the time between two audio samples) is a fixed amount of time – usually 1/44.1kHz, but even if it’s at a different sample rate, we would convert all images to have the same sample rate before training an NN. The equivalent of this for images would be rescaling all images so that a picture of a cat is say 100x100 pixels, while a picture of a tiger is 300x300.

Which, come to think of it, would be potentially an interesting thing to do.

By @theptip - 7 months
> The RAPSD of Gaussian noise is also a straight line on a log-log plot; but a horizontal one, rather than one that slopes down. This reflects the fact that Gaussian noise contains all frequencies in equal measure

Huh. Does this mean that pink noise would be a better prior for diffusion models than Gaussian noise, as your denoiser doesn’t need to learn to adjust the overall distribution? Or is this shift in practice not a hard thing to learn in the scale of a training run?

By @catgary - 7 months
I feel like Song et al characterized diffusion models as SDEs pretty unambiguously, and it connects to Optimal Transport in a pretty unambiguous manner. I understand the desire to give different perspectives, but once you start using multiple hedge words/qualitatives like:

> basically an approximate version of the Fourier transform!

You should take a step back and ask “am I actually muddying the water right now?”

By @slashdave - 7 months
This has little to do with diffusion. The aspects described relate to images (and sound) and are true for VAE models, for example. I mean, what else is a UNet?
By @theo1996 - 7 months
WEll yes econometrics and time series analyses had already described all the methods and functions for """AI"""", but marketing idiots decided t ocreate new names for 30 year old knowledge.