Stable Fast 3D: Rapid 3D Asset Generation from Single Images
Stability AI launched Stable Fast 3D, a model generating high-quality 3D assets from images in 0.5 seconds, suitable for various industries. It offers rapid prototyping and is available on Hugging Face.
Read original articleStability AI has launched Stable Fast 3D, a model that generates high-quality 3D assets from a single image in just 0.5 seconds. This technology is built on TripoSR and features significant architectural enhancements, making it suitable for game and virtual reality developers, as well as professionals in retail, architecture, and design. Users can upload an image, and the model produces a complete 3D asset, including UV unwrapped mesh and material parameters, with options for quad or triangle remeshing. The model is available on Hugging Face and can be accessed via the Stability AI API and the Stable Assistant chatbot, allowing users to share their 3D creations and interact with them in augmented reality.
Stable Fast 3D is particularly beneficial for rapid prototyping in 3D work, with applications in gaming, movie production, and e-commerce. It boasts unmatched speed and quality, outperforming previous models like SV3D, which took 10 minutes for similar tasks. The new model's capabilities include reduced illumination entanglement in textures and the generation of additional material parameters and normal maps. The model is released under the Stability AI Community License, permitting non-commercial use and commercial use for organizations with annual revenues up to $1 million. For those exceeding this threshold, enterprise licenses are available. The model's code is accessible on GitHub, and a technical report detailing its architecture and performance improvements is also provided.
Related
Unique3D: Image-to-3D Generation from a Single Image
The GitHub repository hosts Unique3D, offering efficient 3D mesh generation from a single image. It includes author details, project specifics, setup guides for Linux and Windows, an interactive demo, ComfyUI, tips, acknowledgements, collaborations, and citations.
Meta 3D Gen
Meta introduces Meta 3D Gen (3DGen), a fast text-to-3D asset tool with high prompt fidelity and PBR support. It integrates AssetGen and TextureGen components, outperforming industry baselines in speed and quality.
MASt3R – Matching and Stereo 3D Reconstruction
MASt3R, a model within the DUSt3R framework, excels in 3D reconstruction and feature mapping for image collections. It enhances depth perception, reduces errors, and revolutionizes spatial awareness across industries.
Depth Anything V2
Depth Anything V2 is a monocular depth estimation model trained on synthetic and real images, offering improved details, robustness, and speed compared to previous models. It focuses on enhancing predictions using synthetic images and large-scale pseudo-labeled real images.
The open weight Flux text to image model is next level
Black Forest Labs has launched Flux, the largest open-source text-to-image model with 12 billion parameters, available in three versions. It features enhanced image quality and speed, alongside the release of AuraSR V2.
- Many users express enthusiasm for the potential of AI-generated 3D assets, likening it to the transformative impact of Photoshop.
- Some users report mixed results, noting that the quality of generated models can vary significantly based on input images.
- Concerns are raised about the limitations of the technology, particularly regarding the accuracy of 3D outputs from 2D images.
- There is speculation about the cost-saving potential for industries like gaming, where 3D asset creation is expensive.
- Users are eager for further improvements and applications, including AI-assisted photogrammetry and animation generation.
* so-called "hallucination" (actually just how generative models work) is a feature, not a bug.
* anyone can easily see the unrealistic and biased outputs without complex statistical tests.
* human intuition is useful for evaluation, and not fundamentally misleading (i.e. the equivalent of "this text sounds fluent, so the generator must be intelligent!" hype doesn't really exist for imagery. We're capable of treating it as technology and evaluating it fairly, because there's no equivalent human capability.)
* even lossy, noisy, collapsed and over-trained methods can be valuable for different creative pursuits.
* perfection is not required. You can easily see distorted features in output, and iteratively try to improve them.
* consistency is not required (though it will unlock hugely valuable applications, like video, should it ever arrive).
* technologies like LoRA allow even unskilled users to train character-, style- or concept-specific models with ease.
I've been amazed at how much better image / visual generation models have become in the last year, and IMO, the pace of improvement has not been slowing as much as text models. Moreover, it's becoming increasingly clear that the future isn't the wholesale replacement of photographers, cinematographers, etc., but rather, a generation of crazy AI-based power tools that can do things like add and remove concepts to imagery with a few text prompts. It's insanely useful, and just like Photoshop in the 90s, a new generation of power-users is already emerging, and doing wild things with the tools.
In any case it would be cool if they specified the set of inputs that is expected to give decent results.
Holy cow - I was thinking this might be one of those datacenter-only models but here I am proven wrong. 7GB of VRAM suggests this could run on a lot of hardware that 3D artists own already.
I see these usable not as main assets, but as something you would add as a low effort embellishment to add complexity to the main scene. The fact they maintain profile makes them usable for situations where mere 2d billboard impostor (i.e the original image always oriented towards the camera) would not cut it.
You can totally create a figure image (Midjourney|Bing|Dalle3) and drag and drop it to the image input and get a surprising good 3d presentation that is not a highly detailed model, but something you could very well put to a shelf in a 3d scene as an embellishment where the camera never sees the back of it, and the model is never at the center of attention.
However... mixed success. It's not good with (real) cats yet - which was obvs the first thing I tried. It did reasonably well with a simple image of an iPhone, and actually pretty impressively with a pancake with fruit on top, terribly with a rocket, and impressively again with a rack of pool balls.
[0] https://huggingface.co/spaces/stabilityai/stable-fast-3d
I wonder what the optimum group of technologies is that would enable that kind of mapping? Would you pile on LIDAR, RADAR, this tech, ultrasound, magnetic sensing, etc etc. Although, you're then getting a flying tricorder. Which could enable some cool uses even outside the stereotypical search and rescue.
You can test here: https://huggingface.co/spaces/stabilityai/stable-fast-3d
I wonder whether RAG based 3D animation generation can be done with this.
1. Textual description of a story.
2. Extract/generate keywords from the story using LLM.
3. Search and look up 2D images by the keywords.
4. Generate 3D models from the 2D images using Stable Fast 3D.
5. Extract/generate path description from the story using LLM.
6. Generate movement/animation/gait using some AI.
...
7. Profit??
Related
Unique3D: Image-to-3D Generation from a Single Image
The GitHub repository hosts Unique3D, offering efficient 3D mesh generation from a single image. It includes author details, project specifics, setup guides for Linux and Windows, an interactive demo, ComfyUI, tips, acknowledgements, collaborations, and citations.
Meta 3D Gen
Meta introduces Meta 3D Gen (3DGen), a fast text-to-3D asset tool with high prompt fidelity and PBR support. It integrates AssetGen and TextureGen components, outperforming industry baselines in speed and quality.
MASt3R – Matching and Stereo 3D Reconstruction
MASt3R, a model within the DUSt3R framework, excels in 3D reconstruction and feature mapping for image collections. It enhances depth perception, reduces errors, and revolutionizes spatial awareness across industries.
Depth Anything V2
Depth Anything V2 is a monocular depth estimation model trained on synthetic and real images, offering improved details, robustness, and speed compared to previous models. It focuses on enhancing predictions using synthetic images and large-scale pseudo-labeled real images.
The open weight Flux text to image model is next level
Black Forest Labs has launched Flux, the largest open-source text-to-image model with 12 billion parameters, available in three versions. It features enhanced image quality and speed, alongside the release of AuraSR V2.