August 1st, 2024

Stable Fast 3D: Rapid 3D Asset Generation from Single Images

Stability AI launched Stable Fast 3D, a model generating high-quality 3D assets from images in 0.5 seconds, suitable for various industries. It offers rapid prototyping and is available on Hugging Face.

Read original article

ExcitementSkepticismCuriosity

Stable Fast 3D: Rapid 3D Asset Generation from Single Images

Stability AI has launched Stable Fast 3D, a model that generates high-quality 3D assets from a single image in just 0.5 seconds. This technology is built on TripoSR and features significant architectural enhancements, making it suitable for game and virtual reality developers, as well as professionals in retail, architecture, and design. Users can upload an image, and the model produces a complete 3D asset, including UV unwrapped mesh and material parameters, with options for quad or triangle remeshing. The model is available on Hugging Face and can be accessed via the Stability AI API and the Stable Assistant chatbot, allowing users to share their 3D creations and interact with them in augmented reality.

Stable Fast 3D is particularly beneficial for rapid prototyping in 3D work, with applications in gaming, movie production, and e-commerce. It boasts unmatched speed and quality, outperforming previous models like SV3D, which took 10 minutes for similar tasks. The new model's capabilities include reduced illumination entanglement in textures and the generation of additional material parameters and normal maps. The model is released under the Stability AI Community License, permitting non-commercial use and commercial use for organizations with annual revenues up to $1 million. For those exceeding this threshold, enterprise licenses are available. The model's code is accessible on GitHub, and a technical report detailing its architecture and performance improvements is also provided.

Unique3D: Image-to-3D Generation from a Single Image

The GitHub repository hosts Unique3D, offering efficient 3D mesh generation from a single image. It includes author details, project specifics, setup guides for Linux and Windows, an interactive demo, ComfyUI, tips, acknowledgements, collaborations, and citations.

Meta 3D Gen

Meta introduces Meta 3D Gen (3DGen), a fast text-to-3D asset tool with high prompt fidelity and PBR support. It integrates AssetGen and TextureGen components, outperforming industry baselines in speed and quality.

MASt3R – Matching and Stereo 3D Reconstruction

MASt3R, a model within the DUSt3R framework, excels in 3D reconstruction and feature mapping for image collections. It enhances depth perception, reduces errors, and revolutionizes spatial awareness across industries.

Depth Anything V2

Depth Anything V2 is a monocular depth estimation model trained on synthetic and real images, offering improved details, robustness, and speed compared to previous models. It focuses on enhancing predictions using synthetic images and large-scale pseudo-labeled real images.

The open weight Flux text to image model is next level

Black Forest Labs has launched Flux, the largest open-source text-to-image model with 12 billion parameters, available in three versions. It features enhanced image quality and speed, alongside the release of AuraSR V2.

AI: What people are saying

The comments on Stability AI's Stable Fast 3D reveal a mix of excitement and skepticism about the technology's capabilities and applications.

Many users express enthusiasm for the potential of AI-generated 3D assets, likening it to the transformative impact of Photoshop.
Some users report mixed results, noting that the quality of generated models can vary significantly based on input images.
Concerns are raised about the limitations of the technology, particularly regarding the accuracy of 3D outputs from 2D images.
There is speculation about the cost-saving potential for industries like gaming, where 3D asset creation is expensive.
Users are eager for further improvements and applications, including AI-assisted photogrammetry and animation generation.

23 comments

By @timr - 9 months

For all of the hype around LLMs, this general area (image generation and graphical assets) seems to me to be the big long-term winner of current-generation AI. It hits the sweet spot for the fundamental limitations of the methods:

* so-called "hallucination" (actually just how generative models work) is a feature, not a bug.

* anyone can easily see the unrealistic and biased outputs without complex statistical tests.

* human intuition is useful for evaluation, and not fundamentally misleading (i.e. the equivalent of "this text sounds fluent, so the generator must be intelligent!" hype doesn't really exist for imagery. We're capable of treating it as technology and evaluating it fairly, because there's no equivalent human capability.)

* even lossy, noisy, collapsed and over-trained methods can be valuable for different creative pursuits.

* perfection is not required. You can easily see distorted features in output, and iteratively try to improve them.

* consistency is not required (though it will unlock hugely valuable applications, like video, should it ever arrive).

* technologies like LoRA allow even unskilled users to train character-, style- or concept-specific models with ease.

I've been amazed at how much better image / visual generation models have become in the last year, and IMO, the pace of improvement has not been slowing as much as text models. Moreover, it's becoming increasingly clear that the future isn't the wholesale replacement of photographers, cinematographers, etc., but rather, a generation of crazy AI-based power tools that can do things like add and remove concepts to imagery with a few text prompts. It's insanely useful, and just like Photoshop in the 90s, a new generation of power-users is already emerging, and doing wild things with the tools.

By @woolion - 9 months

This is the third image to 3D AI I've tested, and in all cases the examples they give look like 2D renders of 3D models already. My tests were with cel-shaded images (cartoony, not with realistic lighting) and the model outputs something very flat but with very bad topology, which is worse than starting with a low poly or extruding the drawing. I suspect it is unable to give decent results without accurate shadows from which the normal vectors could be recomputed and thus lacks any 'understanding' of what the structure would be from the lines and forms.

In any case it would be cool if they specified the set of inputs that is expected to give decent results.

By @talldayo - 9 months

> 0.5 seconds per 3D asset generation on a GPU with 7GB VRAM

Holy cow - I was thinking this might be one of those datacenter-only models but here I am proven wrong. 7GB of VRAM suggests this could run on a lot of hardware that 3D artists own already.

By @puppycodes - 9 months

I really can't wait for this technology to improve. Unfortunately just from testing this it seems not very useful. It takes more work to modify the bad model it approximates from the image output than starting with a good foundation from scratch. I would rather see something that took a series of steps to reach a higher quality end product more slowly instead of expecting everything to come from one image. Perhaps i'm missing the use case?

By @fsloth - 9 months

Not the holy grail yet, but pretty cool!

I see these usable not as main assets, but as something you would add as a low effort embellishment to add complexity to the main scene. The fact they maintain profile makes them usable for situations where mere 2d billboard impostor (i.e the original image always oriented towards the camera) would not cut it.

You can totally create a figure image (Midjourney|Bing|Dalle3) and drag and drop it to the image input and get a surprising good 3d presentation that is not a highly detailed model, but something you could very well put to a shelf in a 3d scene as an embellishment where the camera never sees the back of it, and the model is never at the center of attention.

By @mft_ - 9 months

I'm really excited for something in this area to really deliver, and it's really cool that I can just drag pictures into the demo on HuggingFace [0] to try it.

However... mixed success. It's not good with (real) cats yet - which was obvs the first thing I tried. It did reasonably well with a simple image of an iPhone, and actually pretty impressively with a pancake with fruit on top, terribly with a rocket, and impressively again with a rack of pool balls.

[0] https://huggingface.co/spaces/stabilityai/stable-fast-3d

By @calini - 9 months

I'm going to 3D print so much dumb stuff with this.

By @nwoli - 9 months

Huggingface space to try it https://huggingface.co/spaces/stabilityai/stable-fast-3d

By @Y_Y - 9 months

It really looks like they've been doing that classic infomercial tactic of desaturating the images of the things they're comparing against to make theirs seen better.

By @quantumwoke - 9 months

Great result. Just had a play around with the demo models and they preserve structure really nicely; although the textures are still not great. It's kind of a voxelized version of the input image

By @msp26 - 9 months

You can interact with the models on their project page: https://stable-fast-3d.github.io/

By @specproc - 9 months

Be still my miniature-painting heart.

By @bloopernova - 9 months

Closer and closer to the automatic mapping drones from Prometheus.

I wonder what the optimum group of technologies is that would enable that kind of mapping? Would you pile on LIDAR, RADAR, this tech, ultrasound, magnetic sensing, etc etc. Although, you're then getting a flying tricorder. Which could enable some cool uses even outside the stereotypical search and rescue.

By @hansc - 9 months

Looks very good on examples, but testing a few Ikea chairs or a Donald Duck image gives very wrong results.

You can test here: https://huggingface.co/spaces/stabilityai/stable-fast-3d

By @ksec - 9 months

Given the Graphics Asset part of AA or AAA Games are the most expensive, i wonder if 3D Asset Generation could perhaps drastically lower that by 50% or more? At least in terms of same output. Because in reality I guess artist will just spend more time in other areas.

By @causi - 9 months

Man it would be so cool to get AI-assisted photogrammetry. Imagine that instead of taking a hundred photos or a slow scan and having to labor over a point cloud, you could just take like three pictures and then go down a checklist. "Is this circular? How long is this straight line? Is this surface flat? What's the angle between these two pieces?" and getting a perfect replica or even a STEP file out of it. Heaven for 3D printers.

By @abidlabs - 9 months

Official Gradio demo is here: https://huggingface.co/spaces/stabilityai/stable-fast-3d

By @voidUpdate - 9 months

What I'd really like to see in these kinds of articles is examples of it not working as well. I don't necessarily want to see it being perfect, I'd quite like to see its limitations too

By @nextworddev - 9 months

For those reading from Stability - just tried it - API seems to be down and the notebook doesn't have the example code it claimed to have.

By @kleiba - 9 months

This is good news for the indie game dev scene, I suppose?

By @ww520 - 9 months

This is a great step forward.

I wonder whether RAG based 3D animation generation can be done with this.

1. Textual description of a story.

2. Extract/generate keywords from the story using LLM.

3. Search and look up 2D images by the keywords.

4. Generate 3D models from the 2D images using Stable Fast 3D.

5. Extract/generate path description from the story using LLM.

6. Generate movement/animation/gait using some AI.

...

7. Profit??

Stable Fast 3D: Rapid 3D Asset Generation from Single Images

Related

Unique3D: Image-to-3D Generation from a Single Image

Meta 3D Gen

MASt3R – Matching and Stereo 3D Reconstruction

Depth Anything V2

The open weight Flux text to image model is next level

Related

Unique3D: Image-to-3D Generation from a Single Image

Meta 3D Gen

MASt3R – Matching and Stereo 3D Reconstruction

Depth Anything V2

The open weight Flux text to image model is next level