August 24th, 2024

Foundation for Human Vision Models

Sapiens, developed by Facebook Research, enhances human-centric vision tasks with pretrained models on 300 million images, offering lite and full installation options, guides for finetuning, and support for various tasks.

Read original articleLink Icon
Foundation for Human Vision Models

Sapiens is a foundation for human vision models developed by Facebook Research, aimed at enhancing human-centric vision tasks such as 2D pose estimation, part segmentation, depth estimation, and normal estimation. The models are pretrained on a dataset of 300 million human images and optimized for high-resolution feature extraction at 1024 x 1024 pixels. Users can get started by cloning the repository and setting the SAPIENS_ROOT path. There are two installation options: a lite version for inference-only use and a full installation for replicating the complete training setup. Sapiens supports various tasks including image encoding, pose estimation, body part segmentation, depth estimation, and surface normal estimation. Guides for finetuning the models are also provided, particularly for surface normal estimation. The project acknowledges contributions from OpenMMLab and encourages users to report issues for support. It is licensed under specific terms outlined in the repository, and users are encouraged to cite the project in their research using the provided BibTeX entry.

- Sapiens is designed for human-centric vision tasks and pretrained on 300 million images.

- The models operate at a resolution of 1024 x 1024 pixels.

- Users can choose between a lite installation for inference or a full installation for training.

- The repository includes guides for various tasks and finetuning models.

- Contributions from OpenMMLab are acknowledged, and users can cite the project in research.

Link Icon 5 comments
By @yoknapathawa - 8 months
Vision transformer trained on 300M human images with state of the art results on a bunch of human tasks (keypoints, segmentation, depth, normals).

Disclaimer: Co-author here.

By @aithrowaway1987 - 8 months
The shadiness about Facebook's proprietary dataset of 300 million photos is concerning and should draw more attention. At the very least it is scientifically unacceptable - we should not high-five Big Tech researchers for intentionally unreproducible research. And if Meta is harvesting user photos for AI research and commercialization, they should tell their users about it directly (I am sure there is something buried in the TOS). Does the dataset include only public photos, or are Instagram DMs fair game? Does it include CSAM? Who cares!

Serious question: who are the people in the illustrations they used in the paper?[1] Are they Facebook/Instagram users? Did the authors ask permission to use their photos for an arXiv publication? Including their kids? Meta researchers really should be answering questions like this before they are asked - but these authors didn't even include an impact statement!

https://arxiv.org/abs/2408.12569

By @notjoemama - 8 months
At one point, well before Facebook was 'Facebook' in modern parlance, I posted a photo of a potato on my timeline knowing that somewhere in their object graph, I == potato. I'm certain my potatodentity isn't in this dataset, but one can hope that joke eventually lands.
By @vessenes - 8 months
Um, this looks really, really good.

Yo @yoknapthawa, can this be finetuned on an M3 chip? How much RAM is needed? What are the current low hanging fruit-type tasks you think the community could go at? What's latency like? I didn't see anything on the page / in the paper / github about speeds.

I'm also curious about the classes you use for the segmentation task -- do you have a list of them somewhere?

Finally, your generalization results are all on photorealistic images, did you do any looking at paintings / animation / other? I'm curious how broadly the generalization goes.

As always, thank you for opening the weights.

By @Dig1t - 8 months
Anyone know how feasible it would be to use this for doing mocap for a game?