Foundation for Human Vision Models
Sapiens, developed by Facebook Research, enhances human-centric vision tasks with pretrained models on 300 million images, offering lite and full installation options, guides for finetuning, and support for various tasks.
Read original articleSapiens is a foundation for human vision models developed by Facebook Research, aimed at enhancing human-centric vision tasks such as 2D pose estimation, part segmentation, depth estimation, and normal estimation. The models are pretrained on a dataset of 300 million human images and optimized for high-resolution feature extraction at 1024 x 1024 pixels. Users can get started by cloning the repository and setting the SAPIENS_ROOT path. There are two installation options: a lite version for inference-only use and a full installation for replicating the complete training setup. Sapiens supports various tasks including image encoding, pose estimation, body part segmentation, depth estimation, and surface normal estimation. Guides for finetuning the models are also provided, particularly for surface normal estimation. The project acknowledges contributions from OpenMMLab and encourages users to report issues for support. It is licensed under specific terms outlined in the repository, and users are encouraged to cite the project in their research using the provided BibTeX entry.
- Sapiens is designed for human-centric vision tasks and pretrained on 300 million images.
- The models operate at a resolution of 1024 x 1024 pixels.
- Users can choose between a lite installation for inference or a full installation for training.
- The repository includes guides for various tasks and finetuning models.
- Contributions from OpenMMLab are acknowledged, and users can cite the project in research.
Related
Sam 2: Segment Anything in Images and Videos
The GitHub repository for Segment Anything Model 2 (SAM 2) by Facebook Research enhances visual segmentation with real-time video processing, a large dataset, and APIs for image and video predictions.
Meta introduces Segment Anything Model 2
Meta has launched the Segment Anything Model 2 (SAM 2) for segmenting objects in images and videos, featuring real-time processing, zero-shot performance, and open-sourced resources for enhanced user interaction.
The open weight Flux text to image model is next level
Black Forest Labs has launched Flux, the largest open-source text-to-image model with 12 billion parameters, available in three versions. It features enhanced image quality and speed, alongside the release of AuraSR V2.
Segment Anything Model and Friends
The Segment Anything Model (SAM) advances image segmentation with a promptable architecture, trained on 1B masks for zero-shot tasks, leading to efficient variants like FastSAM and MobileSAM for improved performance.
Segment Anything 2: Demo-First Model Development
Segment Anything 2 (SAM 2) enhances image and video segmentation with improved accuracy and speed, utilizing a large dataset and innovative features like memory attention for real-time processing.
Disclaimer: Co-author here.
Serious question: who are the people in the illustrations they used in the paper?[1] Are they Facebook/Instagram users? Did the authors ask permission to use their photos for an arXiv publication? Including their kids? Meta researchers really should be answering questions like this before they are asked - but these authors didn't even include an impact statement!
Yo @yoknapthawa, can this be finetuned on an M3 chip? How much RAM is needed? What are the current low hanging fruit-type tasks you think the community could go at? What's latency like? I didn't see anything on the page / in the paper / github about speeds.
I'm also curious about the classes you use for the segmentation task -- do you have a list of them somewhere?
Finally, your generalization results are all on photorealistic images, did you do any looking at paintings / animation / other? I'm curious how broadly the generalization goes.
As always, thank you for opening the weights.
Related
Sam 2: Segment Anything in Images and Videos
The GitHub repository for Segment Anything Model 2 (SAM 2) by Facebook Research enhances visual segmentation with real-time video processing, a large dataset, and APIs for image and video predictions.
Meta introduces Segment Anything Model 2
Meta has launched the Segment Anything Model 2 (SAM 2) for segmenting objects in images and videos, featuring real-time processing, zero-shot performance, and open-sourced resources for enhanced user interaction.
The open weight Flux text to image model is next level
Black Forest Labs has launched Flux, the largest open-source text-to-image model with 12 billion parameters, available in three versions. It features enhanced image quality and speed, alongside the release of AuraSR V2.
Segment Anything Model and Friends
The Segment Anything Model (SAM) advances image segmentation with a promptable architecture, trained on 1B masks for zero-shot tasks, leading to efficient variants like FastSAM and MobileSAM for improved performance.
Segment Anything 2: Demo-First Model Development
Segment Anything 2 (SAM 2) enhances image and video segmentation with improved accuracy and speed, utilizing a large dataset and innovative features like memory attention for real-time processing.