August 28th, 2024

Sapiens: Foundation for Human Vision Models

The "Sapiens" models enhance human-centric vision tasks through self-supervised pretraining on 300 million images, showing strong generalization and scalability, outperforming benchmarks in several datasets.

Read original articleLink Icon
Sapiens: Foundation for Human Vision Models

The paper titled "Sapiens: Foundation for Human Vision Models" introduces a new family of models designed for four key human-centric vision tasks: 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. These models are capable of high-resolution inference and can be easily adapted for specific tasks through fine-tuning on a dataset of over 300 million human images. The authors found that self-supervised pretraining on this curated dataset significantly enhances performance across various tasks, even when labeled data is limited. The models demonstrate strong generalization capabilities to real-world data and scalability, with performance improving as the number of parameters increases from 0.3 to 2 billion. The Sapiens models outperform existing benchmarks, achieving notable improvements over previous state-of-the-art results in several datasets, including Humans-5K, Humans-2K, Hi4D, and THuman2. This research was presented at ECCV 2024.

- The Sapiens models address four fundamental human-centric vision tasks.

- They utilize self-supervised pretraining on a large dataset to enhance performance.

- The models show strong generalization to in-the-wild data.

- Performance improves with increased model parameters, demonstrating scalability.

- Significant improvements were achieved over previous benchmarks in multiple datasets.

Link Icon 1 comments
By @lelag - 8 months