Sam 2: Segment Anything in Images and Videos
The GitHub repository for Segment Anything Model 2 (SAM 2) by Facebook Research enhances visual segmentation with real-time video processing, a large dataset, and APIs for image and video predictions.
Read original articleThe GitHub repository for Segment Anything Model 2 (SAM 2), developed by Facebook Research, focuses on promptable visual segmentation in images and videos, enhancing the capabilities of its predecessor, SAM. SAM 2 employs a transformer architecture with streaming memory, enabling real-time video processing. It introduces the SA-V dataset, the largest video segmentation dataset available. Installation on a GPU machine is facilitated through pip, requiring Jupyter and Matplotlib for example notebooks. The repository offers APIs for image and video predictions, allowing users to add prompts and track multiple objects in videos.
To install SAM 2, users can clone the repository and install it using pip. Example usage for image prediction involves importing necessary libraries, loading a model checkpoint, and using the SAM2ImagePredictor to set an image and predict masks based on input prompts. For video prediction, a similar approach is taken with the build_sam2_video_predictor, where users initialize the state with a video and propagate predictions across frames.
The repository serves as a significant advancement in visual segmentation, providing essential tools for researchers and developers working with image and video data. Additional resources, including a project page, demo, and research paper, are available for further exploration of SAM 2's capabilities.
Related
Video annotator: a framework for efficiently building video classifiers
The Netflix Technology Blog presents the Video Annotator (VA) framework for efficient video classifier creation. VA integrates vision-language models, active learning, and user validation, outperforming baseline methods with an 8.3 point Average Precision improvement.
Show HN: AI assisted image editing with audio instructions
The GitHub repository hosts "AAIELA: AI Assisted Image Editing with Language and Audio," a project enabling image editing via audio commands and AI models. It integrates various technologies for object detection, language processing, and image inpainting. Future plans involve model enhancements and feature integrations.
LivePortrait: A fast, controllable portrait animation model
The GitHub repository contains the PyTorch implementation of "LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control" project. It provides instructions for setup, inference, Gradio interface usage, speed evaluation, acknowledgements, and citations.
From the Tensor to Stable Diffusion
The GitHub repository offers a comprehensive machine learning guide covering deep learning, vision-language models, neural networks, CNNs, RNNs, and paper implementations like LeNet, AlexNet, ResNet, GRU, LSTM, CBOW, Skip-Gram, Transformer, and BERT. Ideal for exploring machine learning concepts.
TreeSeg: Hierarchical Topic Segmentation of Large Transcripts
Augmend is creating a platform to automate tribal knowledge for development teams, featuring the TreeSeg algorithm, which segments session data into chapters by analyzing audio transcriptions and semantic actions.
- Users are eager to explore practical applications, such as plant health monitoring and wood stock counting.
- Many are curious about the model's performance, particularly in handling complex objects and real-time video processing.
- There are inquiries about the model's accessibility, including concerns about self-hosting and compatibility with different platforms.
- Some users express interest in the technical aspects, such as training costs, memory attention mechanisms, and potential use in OCR.
- Concerns about the model's limitations, such as tracking objects out of frame and handling transparent materials, are also raised.
1. SAM 2 was trained on 256 A100 GPUs for 108 hours (SAM1 was 68 hrs on same cluster). Taking the upper end $2 A100 cost off gpulist means SAM2 cost ~$50k to train - surprisingly cheap for adding video understanding?
2. new dataset: the new SA-V dataset is "only" 50k videos, with careful attention given to scene/object/geographical diversity incl that of annotators. I wonder if LAION or Datacomp (AFAICT the only other real players in the open image data space) can reach this standard..
3. bootstrapped annotation: similar to SAM1, a 3 phase approach where 16k initial annotations across 1.4k videos was then expanded to 63k+197k more with SAM 1+2 assistance, with annotation time accelerating dramatically (89% faster than SAM1 only) by the end
4. memory attention: SAM2 is a transformer with memory across frames! special "object pointer" tokens stored in a "memory bank" FIFO queue of recent and prompted frames. Has this been explored in language models? whoa?
(written up in https://x.com/swyx/status/1818074658299855262)
I selected each shoe as individual objects and the model was able to segment them even as they overlapped.
Would be extremely useful to be able to semantically "chunk" text for RAG applications compared to the generally naive strategies employed today.
If I somehow overlooked it, would be very interested in hearing about what you've seen.
I was initially thinking the obvious case would be some sort of system for monitoring your plant health. It could check for shrinkage / growth, colour change etc and build some sort of monitoring tool / automated watering system off that.
[1] - page 11 of https://ai.meta.com/research/publications/sam-2-segment-anyt...
Splashing water or Orange juice, spraying snow from skis, rain and snowfall, foliage, fences and meshes, veils etc.
Does anyone know if this already happened?
Alright, I'll bite, why not?
I can't make sense of this sentence. Is there some mistake?
Is SAM-2 useful to use as a base model to finetune a classifier layer on? Or are there better options today?
I do have a 2 questions: 1. isn't addressing the video frame by frame expensive? 2. In the web demo when the leg moves fast it loses it's track from the shoe. Does the memory part not throwing some uristics to over come this edge case?
i.e. if I stand in the center of my room and take a video of the room spinning around slowly over 5 seconds. Then reverse spin around for 5 seconds.
Will it see the same couch? Or will it see two couches?
Quote: "Sorry Firefox users! The Firefox browser doesn’t support the video features we’ll need to run this demo. Please try again using Chrome or Safari."
Wtf is this shit? Seriously!
I was wondering why the original one got deprecated.
Is there now also a good way for finetuning from the officaial / your side?
Any benchmarks against SAM1?
Related
Video annotator: a framework for efficiently building video classifiers
The Netflix Technology Blog presents the Video Annotator (VA) framework for efficient video classifier creation. VA integrates vision-language models, active learning, and user validation, outperforming baseline methods with an 8.3 point Average Precision improvement.
Show HN: AI assisted image editing with audio instructions
The GitHub repository hosts "AAIELA: AI Assisted Image Editing with Language and Audio," a project enabling image editing via audio commands and AI models. It integrates various technologies for object detection, language processing, and image inpainting. Future plans involve model enhancements and feature integrations.
LivePortrait: A fast, controllable portrait animation model
The GitHub repository contains the PyTorch implementation of "LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control" project. It provides instructions for setup, inference, Gradio interface usage, speed evaluation, acknowledgements, and citations.
From the Tensor to Stable Diffusion
The GitHub repository offers a comprehensive machine learning guide covering deep learning, vision-language models, neural networks, CNNs, RNNs, and paper implementations like LeNet, AlexNet, ResNet, GRU, LSTM, CBOW, Skip-Gram, Transformer, and BERT. Ideal for exploring machine learning concepts.
TreeSeg: Hierarchical Topic Segmentation of Large Transcripts
Augmend is creating a platform to automate tribal knowledge for development teams, featuring the TreeSeg algorithm, which segments session data into chapters by analyzing audio transcriptions and semantic actions.