The open weight Flux text to image model is next level
Black Forest Labs has launched Flux, the largest open-source text-to-image model with 12 billion parameters, available in three versions. It features enhanced image quality and speed, alongside the release of AuraSR V2.
Read original articleBlack Forest Labs has announced Flux, the largest open-source text-to-image model to date, featuring 12 billion parameters. This model aims to enhance creativity and performance, producing images with aesthetics similar to Midjourney. Flux is available in three variations: FLUX.1 [dev], an open-sourced base model under a non-commercial license; FLUX.1 [schnell], a distilled version that operates up to ten times faster and is licensed under Apache 2; and FLUX.1 [pro], a closed-source version accessible only through an API. The integration of a new inference engine allows Flux models to run up to twice as fast compared to previous methods, ensuring high-quality outputs. Key features of Flux include enhanced image quality, advanced human anatomy and photorealism, improved adherence to prompts, and exceptional speed, particularly with the Schnell variant. Users are encouraged to explore Flux through the provided playgrounds and API documentation. Additionally, the announcement mentions the release of AuraSR V2, an upgraded version of a single-step GAN upscaler, which follows the positive community response to its predecessor. AuraSR is based on the Adobe Gigagan paper and is designed to upscale low-resolution images significantly. The overall message emphasizes the ongoing development and potential of open-source AI models, countering claims that such initiatives are declining.
Related
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision
A new attention mechanism, FlashAttention-3, boosts Transformer speed and accuracy on Hopper GPUs by up to 75%. Leveraging asynchrony and low-precision computing, it achieves 1.5-2x faster processing, utilizing FP8 for quicker computations and reduced costs. FlashAttention-3 optimizes for new hardware features, enhancing efficiency and AI capabilities. Integration into PyTorch is planned.
AuraFlow v0.1: a open source alternative to Stable Diffusion 3
AuraFlow v0.1 is an open-source large rectified flow model for text-to-image generation. Developed to boost transparency and collaboration in AI, it optimizes training efficiency and achieves notable advancements.
Meta releases an open-weights GPT-4-level AI model, Llama 3.1 405B
Meta has launched Llama 3.1 405B, a free AI language model with 405 billion parameters, challenging closed AI models. Users can download it for personal use, promoting open-source AI principles. Mark Zuckerberg endorses this move.
Diffusion Training from Scratch on a Micro-Budget
The paper presents a cost-effective method for training text-to-image generative models by masking image patches and using synthetic images, achieving competitive performance at significantly lower costs.
Stable Fast 3D: Rapid 3D Asset Generation from Single Images
Stability AI launched Stable Fast 3D, a model generating high-quality 3D assets from images in 0.5 seconds, suitable for various industries. It offers rapid prototyping and is available on Hugging Face.
- Users express mixed feelings about the model's ability to generate complex prompts, with some noting failures in accurately depicting spatial relationships and specific scenarios.
- Many users praise the image quality and speed of Flux, highlighting its potential for local use and comparisons to other models like DALL-E 3 and MidJourney.
- Concerns are raised about the model's licensing and the sustainability of open-source projects, with some questioning the business model behind such offerings.
- Several comments focus on the technical aspects, including the model's performance on different hardware and its ability to handle text rendering.
- Users share links to try the model and discuss their experiences, indicating a strong interest in exploring its capabilities further.
what we did at fal is take the model and run it on our inference engine optimized to run these kinds of models really really fast. feel free to give it a shot on the playgrounds. https://fal.ai/models/fal-ai/flux/dev
It is very fast and very good at rendering text, and appears to have a text encoder such that the model can handle both text and positioning much better: https://x.com/minimaxir/status/1819041076872908894
A fun consequence of better text rendering is that it means text watermarks from its training data appear more clearly: https://x.com/minimaxir/status/1819045012166127921
(available without sign-in) FLUX.1 [schnell] (Apache 2.0, open weights, step distilled): https://fal.ai/models/fal-ai/flux/schnell
(requires sign-in) FLUX.1 [dev] (non-commercial, open weights, guidance distilled): https://fal.ai/models/fal-ai/flux/dev
FLUX.1 [pro] (closed source [only available thru APIs], SOTA, raw): https://fal.ai/models/fal-ai/flux-pro
If this runs locally, this is very very close to that in terms of both image quality and prompt adherence.
I did fail at writing text clearly when text was a bit complicated. This ideogram image's prompt for example https://ideogram.ai/g/GUw6Vo-tQ8eRWp9x2HONdA/0
> A captivating and artistic illustration of four distinct creative quarters, each representing a unique aspect of creativity. In the top left, a writer with a quill and inkpot is depicted, showcasing their struggle with the text "THE STRUGGLE IS NOT REAL 1: WRITER". The scene is comically portrayed, highlighting the writer's creative challenges. In the top right, a figure labeled "THE STRUGGLE IS NOT REAL 2: COPY ||PASTER" is accompanied by a humorous comic drawing that satirically demonstrates their approach. In the bottom left, "THE STRUGGLE IS NOT REAL 3: THE RETRIER" features a character retrieving items, complete with an entertaining comic illustration. Lastly, in the bottom right, a remixer, identified as "THE STRUGGLE IS NOT REAL 4: THE REMI
Otherwise, the quality is great. I stopped using stable diffusion long time ago, the tools and tech around it became very messy, its not fun anymore. Been using ideogram for fun but I want something like ideogram that I can run locally without any filters. This is looking perfect so far.
This is not ideogram, but its very very good.
Would love to see an AI company attack engineering diagrams head on, my current hunch is that they just aren't in the training dataset (I'm very tempted to make a synthetic dataset/benchmark)
"An upside down house" -> regular old house
"A horse sitting on a dog" -> horse and dog next to eachother
"An inverted Lockheed Martin F-22 Raptor" -> yikes https://fal.media/files/koala/zgPYG6SqhD4Y3y_E9MONu.png
I have seen a lot of promises made by diffusion models.
This is in a whole different world. I legitimately feel bad for the people still a StabilityAI.
The playground testing is really something else!
The licensing model isn’t bad, although I would like to see them promise to open up their old closed source models under Apache when they release new API versions.
The prompt adherence and the breadth of topics it seems to know without a finetune and without any LORAs, is really amazing.
The best prompt adherence on the market right now BY FAR is DALL-E 3 but it still falls down on more complicated concepts and obviously is hugely censored - though weirdly significantly less censored if you hit their API directly.
I quickly mocked up a few weird/complex prompts and did some side-by-side comparisons with Flux and DALL-E 3. Flux is impressive and significantly performant particularly since both the dev/shnell models have been confirmed by Black Forest to be runnable via ComfyUI.
This is missing from the image. The generated image looks well, but while reading the prompt I was surpised it was missing
...then it's not open source. At least the others are Apache 2.0 (real open source) and correctly labeled proprietary, respectively.
Photo of teen girl in a ski mask making an origami swan in a barn. There is caption on the bottom of the image: "EAT DRUGS" in yellow font. In the background there is a framed photo of obama
https://i.imgur.com/RifcWZc.png
Donald Trump on the cover of "Leopards Ate My Face" magazine
It looks like this is the case for LLMs, that the training quality of the data has a significant impact on the output quality of the model, which makes sense.
So the real magic is in designing a system to curate that high quality data.
Complete and utter UX/first impression fail. I had no desire to actualy try the model after this.
I assume you’re offering this as an API? Would be nice to have pricing page as I didn’t see one on your website.
Reddit message: https://www.reddit.com/r/StableDiffusion/comments/1ehh1hx/an...
Linked image: https://preview.redd.it/dz3djnish2gd1.png?width=1024&format=...
The prompt:
> Photo of Criminal in a ski mask making a phone call in front of a store. There is caption on the bottom of the image: "It's time to Counter the Strike...". There is a red arrow pointing towards the caption. The red arrow is from a Red circle which has an image of Halo Master Chief in it.
Some of the images I generated using schnell model with 8-10 steps using this prompt. https://imgur.com/a/3mM9tKf
Really curious to see what other low-hanging fruits people are finding.
Result (distilled schnell model) for
"Photo of Criminal in a ski mask making a phone call in front of a store. There is caption on the bottom of the image: "It's time to Counter the Strike...". There is a red arrow pointing towards the caption. The red arrow is from a Red circle which has an image of Halo Master Chief in it."
Related
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision
A new attention mechanism, FlashAttention-3, boosts Transformer speed and accuracy on Hopper GPUs by up to 75%. Leveraging asynchrony and low-precision computing, it achieves 1.5-2x faster processing, utilizing FP8 for quicker computations and reduced costs. FlashAttention-3 optimizes for new hardware features, enhancing efficiency and AI capabilities. Integration into PyTorch is planned.
AuraFlow v0.1: a open source alternative to Stable Diffusion 3
AuraFlow v0.1 is an open-source large rectified flow model for text-to-image generation. Developed to boost transparency and collaboration in AI, it optimizes training efficiency and achieves notable advancements.
Meta releases an open-weights GPT-4-level AI model, Llama 3.1 405B
Meta has launched Llama 3.1 405B, a free AI language model with 405 billion parameters, challenging closed AI models. Users can download it for personal use, promoting open-source AI principles. Mark Zuckerberg endorses this move.
Diffusion Training from Scratch on a Micro-Budget
The paper presents a cost-effective method for training text-to-image generative models by masking image patches and using synthetic images, achieving competitive performance at significantly lower costs.
Stable Fast 3D: Rapid 3D Asset Generation from Single Images
Stability AI launched Stable Fast 3D, a model generating high-quality 3D assets from images in 0.5 seconds, suitable for various industries. It offers rapid prototyping and is available on Hugging Face.