Diffusion Training from Scratch on a Micro-Budget
The paper presents a cost-effective method for training text-to-image generative models by masking image patches and using synthetic images, achieving competitive performance at significantly lower costs.
Read original articleThe paper titled "Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget" by Vikash Sehwag and colleagues addresses the challenges of training large-scale text-to-image (T2I) generative models, particularly in the context of high computational costs that favor well-resourced developers. The authors propose a novel approach to significantly reduce these costs by implementing a strategy that randomly masks up to 75% of image patches during training. This deferred masking technique, which preprocesses patches using a patch-mixer, minimizes performance degradation compared to traditional model downscaling methods. Additionally, the authors leverage advancements in transformer architecture, including mixture-of-experts layers, to enhance performance. They highlight the effectiveness of using synthetic images in training on a micro-budget. The study demonstrates that with only 37 million publicly available real and synthetic images, they successfully trained a 1.16 billion parameter sparse transformer for a mere $1,890, achieving a Fréchet Inception Distance (FID) of 12.7 in zero-shot generation on the COCO dataset. This model's performance is notably competitive, costing 118 times less than stable diffusion models and 14 times less than the current leading approach, which costs $28,400. The authors aim to release their end-to-end training pipeline to democratize access to large-scale diffusion model training for those with limited budgets.
Related
Diffusion Forcing: Next-Token Prediction Meets Full-Sequence Diffusion
Diffusion Forcing combines full-sequence diffusion models and next-token models for generative modeling. It optimizes token likelihoods, excels in video prediction, stabilizes auto-regressive rollout, and enhances robustness in real-world applications.
Image Self Supervised Learning on a Shoestring
A new cost-effective approach in machine learning, IJEPA, enhances image encoder training by predicting missing parts internally. Released on GitHub, it optimizes image embeddings, reducing computational demands for researchers.
AuraFlow v0.1: a open source alternative to Stable Diffusion 3
AuraFlow v0.1 is an open-source large rectified flow model for text-to-image generation. Developed to boost transparency and collaboration in AI, it optimizes training efficiency and achieves notable advancements.
Transformer Layers as Painters
The study "Transformer Layers as Painters" by Qi Sun et al. delves into transformer models, showcasing layer impact variations and potential for model optimization through strategic layer adjustments.
Diffusion Texture Painting
Researchers introduce Diffusion Texture Painting, a method using generative models for interactive texture painting on 3D meshes. Artists can paint with complex textures and transition seamlessly. The innovative approach aims to inspire generative model exploration.
The risk is that the good guys end up being the only ones hampered by it. Hopefully it won't be so large a burden that the bad guys and especially the so-so guys (those with a real chance, e.g. Alibaba) get a massive leg up.
Edit: they do have comparisons in the paper, and PixArt-α seems to be... more coherent?
Related
Diffusion Forcing: Next-Token Prediction Meets Full-Sequence Diffusion
Diffusion Forcing combines full-sequence diffusion models and next-token models for generative modeling. It optimizes token likelihoods, excels in video prediction, stabilizes auto-regressive rollout, and enhances robustness in real-world applications.
Image Self Supervised Learning on a Shoestring
A new cost-effective approach in machine learning, IJEPA, enhances image encoder training by predicting missing parts internally. Released on GitHub, it optimizes image embeddings, reducing computational demands for researchers.
AuraFlow v0.1: a open source alternative to Stable Diffusion 3
AuraFlow v0.1 is an open-source large rectified flow model for text-to-image generation. Developed to boost transparency and collaboration in AI, it optimizes training efficiency and achieves notable advancements.
Transformer Layers as Painters
The study "Transformer Layers as Painters" by Qi Sun et al. delves into transformer models, showcasing layer impact variations and potential for model optimization through strategic layer adjustments.
Diffusion Texture Painting
Researchers introduce Diffusion Texture Painting, a method using generative models for interactive texture painting on 3D meshes. Artists can paint with complex textures and transition seamlessly. The innovative approach aims to inspire generative model exploration.