August 24th, 2024

Foundation for Human Vision Models

Sapiens, developed by Facebook Research, enhances human-centric vision tasks with pretrained models on 300 million images, offering lite and full installation options, guides for finetuning, and support for various tasks.

Read original article

Sapiens is a foundation for human vision models developed by Facebook Research, aimed at enhancing human-centric vision tasks such as 2D pose estimation, part segmentation, depth estimation, and normal estimation. The models are pretrained on a dataset of 300 million human images and optimized for high-resolution feature extraction at 1024 x 1024 pixels. Users can get started by cloning the repository and setting the SAPIENS_ROOT path. There are two installation options: a lite version for inference-only use and a full installation for replicating the complete training setup. Sapiens supports various tasks including image encoding, pose estimation, body part segmentation, depth estimation, and surface normal estimation. Guides for finetuning the models are also provided, particularly for surface normal estimation. The project acknowledges contributions from OpenMMLab and encourages users to report issues for support. It is licensed under specific terms outlined in the repository, and users are encouraged to cite the project in their research using the provided BibTeX entry.

- Sapiens is designed for human-centric vision tasks and pretrained on 300 million images.

- The models operate at a resolution of 1024 x 1024 pixels.

- Users can choose between a lite installation for inference or a full installation for training.

- The repository includes guides for various tasks and finetuning models.

- Contributions from OpenMMLab are acknowledged, and users can cite the project in research.

Sam 2: Segment Anything in Images and Videos

The GitHub repository for Segment Anything Model 2 (SAM 2) by Facebook Research enhances visual segmentation with real-time video processing, a large dataset, and APIs for image and video predictions.

Meta introduces Segment Anything Model 2

Meta has launched the Segment Anything Model 2 (SAM 2) for segmenting objects in images and videos, featuring real-time processing, zero-shot performance, and open-sourced resources for enhanced user interaction.

The open weight Flux text to image model is next level

Black Forest Labs has launched Flux, the largest open-source text-to-image model with 12 billion parameters, available in three versions. It features enhanced image quality and speed, alongside the release of AuraSR V2.

Segment Anything Model and Friends

The Segment Anything Model (SAM) advances image segmentation with a promptable architecture, trained on 1B masks for zero-shot tasks, leading to efficient variants like FastSAM and MobileSAM for improved performance.

Segment Anything 2: Demo-First Model Development

Segment Anything 2 (SAM 2) enhances image and video segmentation with improved accuracy and speed, utilizing a large dataset and innovative features like memory attention for real-time processing.

5 comments

By @yoknapathawa - 8 months

Vision transformer trained on 300M human images with state of the art results on a bunch of human tasks (keypoints, segmentation, depth, normals).

Disclaimer: Co-author here.

By @aithrowaway1987 - 8 months

The shadiness about Facebook's proprietary dataset of 300 million photos is concerning and should draw more attention. At the very least it is scientifically unacceptable - we should not high-five Big Tech researchers for intentionally unreproducible research. And if Meta is harvesting user photos for AI research and commercialization, they should tell their users about it directly (I am sure there is something buried in the TOS). Does the dataset include only public photos, or are Instagram DMs fair game? Does it include CSAM? Who cares!

Serious question: who are the people in the illustrations they used in the paper?[1] Are they Facebook/Instagram users? Did the authors ask permission to use their photos for an arXiv publication? Including their kids? Meta researchers really should be answering questions like this before they are asked - but these authors didn't even include an impact statement!

https://arxiv.org/abs/2408.12569

By @notjoemama - 8 months

At one point, well before Facebook was 'Facebook' in modern parlance, I posted a photo of a potato on my timeline knowing that somewhere in their object graph, I == potato. I'm certain my potatodentity isn't in this dataset, but one can hope that joke eventually lands.

By @vessenes - 8 months

Um, this looks really, really good.

Yo @yoknapthawa, can this be finetuned on an M3 chip? How much RAM is needed? What are the current low hanging fruit-type tasks you think the community could go at? What's latency like? I didn't see anything on the page / in the paper / github about speeds.

I'm also curious about the classes you use for the segmentation task -- do you have a list of them somewhere?

Finally, your generalization results are all on photorealistic images, did you do any looking at paintings / animation / other? I'm curious how broadly the generalization goes.

As always, thank you for opening the weights.

By @Dig1t - 8 months

Anyone know how feasible it would be to use this for doing mocap for a game?

Sam 2: Segment Anything in Images and Videos

Meta introduces Segment Anything Model 2

The open weight Flux text to image model is next level

Segment Anything Model and Friends

Segment Anything 2: Demo-First Model Development

Segment Anything 2 (SAM 2) enhances image and video segmentation with improved accuracy and speed, utilizing a large dataset and innovative features like memory attention for real-time processing.

Foundation for Human Vision Models

Related

Sam 2: Segment Anything in Images and Videos

Meta introduces Segment Anything Model 2

The open weight Flux text to image model is next level

Segment Anything Model and Friends

Segment Anything 2: Demo-First Model Development

Related

Sam 2: Segment Anything in Images and Videos

Meta introduces Segment Anything Model 2

The open weight Flux text to image model is next level

Segment Anything Model and Friends

Segment Anything 2: Demo-First Model Development