July 30th, 2024

Diffusion Training from Scratch on a Micro-Budget

The paper presents a cost-effective method for training text-to-image generative models by masking image patches and using synthetic images, achieving competitive performance at significantly lower costs.

Read original article

The paper titled "Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget" by Vikash Sehwag and colleagues addresses the challenges of training large-scale text-to-image (T2I) generative models, particularly in the context of high computational costs that favor well-resourced developers. The authors propose a novel approach to significantly reduce these costs by implementing a strategy that randomly masks up to 75% of image patches during training. This deferred masking technique, which preprocesses patches using a patch-mixer, minimizes performance degradation compared to traditional model downscaling methods. Additionally, the authors leverage advancements in transformer architecture, including mixture-of-experts layers, to enhance performance. They highlight the effectiveness of using synthetic images in training on a micro-budget. The study demonstrates that with only 37 million publicly available real and synthetic images, they successfully trained a 1.16 billion parameter sparse transformer for a mere $1,890, achieving a Fréchet Inception Distance (FID) of 12.7 in zero-shot generation on the COCO dataset. This model's performance is notably competitive, costing 118 times less than stable diffusion models and 14 times less than the current leading approach, which costs $28,400. The authors aim to release their end-to-end training pipeline to democratize access to large-scale diffusion model training for those with limited budgets.

Diffusion Forcing: Next-Token Prediction Meets Full-Sequence Diffusion

Diffusion Forcing combines full-sequence diffusion models and next-token models for generative modeling. It optimizes token likelihoods, excels in video prediction, stabilizes auto-regressive rollout, and enhances robustness in real-world applications.

Image Self Supervised Learning on a Shoestring

A new cost-effective approach in machine learning, IJEPA, enhances image encoder training by predicting missing parts internally. Released on GitHub, it optimizes image embeddings, reducing computational demands for researchers.

AuraFlow v0.1: a open source alternative to Stable Diffusion 3

AuraFlow v0.1 is an open-source large rectified flow model for text-to-image generation. Developed to boost transparency and collaboration in AI, it optimizes training efficiency and achieves notable advancements.

Transformer Layers as Painters

The study "Transformer Layers as Painters" by Qi Sun et al. delves into transformer models, showcasing layer impact variations and potential for model optimization through strategic layer adjustments.

Diffusion Texture Painting

Researchers introduce Diffusion Texture Painting, a method using generative models for interactive texture painting on 3D meshes. Artists can paint with complex textures and transition seamlessly. The innovative approach aims to inspire generative model exploration.

7 comments

By @worstspotgain - 9 months

Asymptotic improvements are flattening the cost curves so fast that AI regulation might become practically meaningless by the end of the year. If you want unregulated output you'll have tons of offshore models to choose from.

The risk is that the good guys end up being the only ones hampered by it. Hopefully it won't be so large a burden that the bad guys and especially the so-so guys (those with a real chance, e.g. Alibaba) get a massive leg up.

By @Flux159 - 9 months

This kind of research is great for reducing training costs as well as enabling more people to experiment with training large models. Hopefully in 5-10 years we'll be able to train a model on par with SD 1.5 with consumer gpus since that would be great for teaching model development.

By @orbital-decay - 9 months

Reminds me of PixArt-α which was also trained on the similarly tiny budget ($28,000). [0] How good is their result, though? Training a toy model is one thing, making something usable (let alone competitive) is another.

Edit: they do have comparisons in the paper, and PixArt-α seems to be... more coherent?

[0] https://pixart-alpha.github.io/

By @p1esk - 9 months

Interesting - they say using FP8 didn’t provide any speed up.

By @sorenjan - 9 months

One thing I've wondered about is fine tuning a large model from multiple LoRAs. If the model doesn't fit in your vram you can train a LoRA, apply it to the model, train another LoRA from the same data, apply it, and so on. Iterative low rank parameter updates. Would that work?

Diffusion Training from Scratch on a Micro-Budget

Related

Diffusion Forcing: Next-Token Prediction Meets Full-Sequence Diffusion

Image Self Supervised Learning on a Shoestring

AuraFlow v0.1: a open source alternative to Stable Diffusion 3

Transformer Layers as Painters

Diffusion Texture Painting

Related

Diffusion Forcing: Next-Token Prediction Meets Full-Sequence Diffusion

Image Self Supervised Learning on a Shoestring

AuraFlow v0.1: a open source alternative to Stable Diffusion 3

Transformer Layers as Painters

Diffusion Texture Painting