July 30th, 2024

Diffusion Training from Scratch on a Micro-Budget

The paper presents a cost-effective method for training text-to-image generative models by masking image patches and using synthetic images, achieving competitive performance at significantly lower costs.

Read original articleLink Icon
Diffusion Training from Scratch on a Micro-Budget

The paper titled "Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget" by Vikash Sehwag and colleagues addresses the challenges of training large-scale text-to-image (T2I) generative models, particularly in the context of high computational costs that favor well-resourced developers. The authors propose a novel approach to significantly reduce these costs by implementing a strategy that randomly masks up to 75% of image patches during training. This deferred masking technique, which preprocesses patches using a patch-mixer, minimizes performance degradation compared to traditional model downscaling methods. Additionally, the authors leverage advancements in transformer architecture, including mixture-of-experts layers, to enhance performance. They highlight the effectiveness of using synthetic images in training on a micro-budget. The study demonstrates that with only 37 million publicly available real and synthetic images, they successfully trained a 1.16 billion parameter sparse transformer for a mere $1,890, achieving a Fréchet Inception Distance (FID) of 12.7 in zero-shot generation on the COCO dataset. This model's performance is notably competitive, costing 118 times less than stable diffusion models and 14 times less than the current leading approach, which costs $28,400. The authors aim to release their end-to-end training pipeline to democratize access to large-scale diffusion model training for those with limited budgets.

Link Icon 7 comments
By @worstspotgain - 3 months
Asymptotic improvements are flattening the cost curves so fast that AI regulation might become practically meaningless by the end of the year. If you want unregulated output you'll have tons of offshore models to choose from.

The risk is that the good guys end up being the only ones hampered by it. Hopefully it won't be so large a burden that the bad guys and especially the so-so guys (those with a real chance, e.g. Alibaba) get a massive leg up.

By @Flux159 - 3 months
This kind of research is great for reducing training costs as well as enabling more people to experiment with training large models. Hopefully in 5-10 years we'll be able to train a model on par with SD 1.5 with consumer gpus since that would be great for teaching model development.
By @orbital-decay - 3 months
Reminds me of PixArt-α which was also trained on the similarly tiny budget ($28,000). [0] How good is their result, though? Training a toy model is one thing, making something usable (let alone competitive) is another.

Edit: they do have comparisons in the paper, and PixArt-α seems to be... more coherent?

[0] https://pixart-alpha.github.io/

By @p1esk - 3 months
Interesting - they say using FP8 didn’t provide any speed up.
By @sorenjan - 3 months
One thing I've wondered about is fine tuning a large model from multiple LoRAs. If the model doesn't fit in your vram you can train a LoRA, apply it to the model, train another LoRA from the same data, apply it, and so on. Iterative low rank parameter updates. Would that work?