July 29th, 2024

Sam 2: Segment Anything in Images and Videos

The GitHub repository for Segment Anything Model 2 (SAM 2) by Facebook Research enhances visual segmentation with real-time video processing, a large dataset, and APIs for image and video predictions.

Read original article

ExcitementCuriosityFrustration

Sam 2: Segment Anything in Images and Videos

The GitHub repository for Segment Anything Model 2 (SAM 2), developed by Facebook Research, focuses on promptable visual segmentation in images and videos, enhancing the capabilities of its predecessor, SAM. SAM 2 employs a transformer architecture with streaming memory, enabling real-time video processing. It introduces the SA-V dataset, the largest video segmentation dataset available. Installation on a GPU machine is facilitated through pip, requiring Jupyter and Matplotlib for example notebooks. The repository offers APIs for image and video predictions, allowing users to add prompts and track multiple objects in videos.

To install SAM 2, users can clone the repository and install it using pip. Example usage for image prediction involves importing necessary libraries, loading a model checkpoint, and using the SAM2ImagePredictor to set an image and predict masks based on input prompts. For video prediction, a similar approach is taken with the build_sam2_video_predictor, where users initialize the state with a video and propagate predictions across frames.

The repository serves as a significant advancement in visual segmentation, providing essential tools for researchers and developers working with image and video data. Additional resources, including a project page, demo, and research paper, are available for further exploration of SAM 2's capabilities.

Video annotator: a framework for efficiently building video classifiers

The Netflix Technology Blog presents the Video Annotator (VA) framework for efficient video classifier creation. VA integrates vision-language models, active learning, and user validation, outperforming baseline methods with an 8.3 point Average Precision improvement.

Show HN: AI assisted image editing with audio instructions

The GitHub repository hosts "AAIELA: AI Assisted Image Editing with Language and Audio," a project enabling image editing via audio commands and AI models. It integrates various technologies for object detection, language processing, and image inpainting. Future plans involve model enhancements and feature integrations.

LivePortrait: A fast, controllable portrait animation model

The GitHub repository contains the PyTorch implementation of "LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control" project. It provides instructions for setup, inference, Gradio interface usage, speed evaluation, acknowledgements, and citations.

From the Tensor to Stable Diffusion

The GitHub repository offers a comprehensive machine learning guide covering deep learning, vision-language models, neural networks, CNNs, RNNs, and paper implementations like LeNet, AlexNet, ResNet, GRU, LSTM, CBOW, Skip-Gram, Transformer, and BERT. Ideal for exploring machine learning concepts.

TreeSeg: Hierarchical Topic Segmentation of Large Transcripts

Augmend is creating a platform to automate tribal knowledge for development teams, featuring the TreeSeg algorithm, which segments session data into chapters by analyzing audio transcriptions and semantic actions.

AI: What people are saying

The release of Segment Anything Model 2 (SAM 2) by Facebook Research has generated excitement and discussion among users.

Users are eager to explore practical applications, such as plant health monitoring and wood stock counting.
Many are curious about the model's performance, particularly in handling complex objects and real-time video processing.
There are inquiries about the model's accessibility, including concerns about self-hosting and compatibility with different platforms.
Some users express interest in the technical aspects, such as training costs, memory attention mechanisms, and potential use in OCR.
Concerns about the model's limitations, such as tracking objects out of frame and handling transparent materials, are also raised.

42 comments

By @nravi20 - 9 months

Hi from the Segment Anything team! Today we’re releasing Segment Anything Model 2! It's the first unified model for real-time promptable object segmentation in images and videos! We're releasing the code, models, dataset, research paper and a demo! We're excited to see what everyone builds! https://ai.meta.com/blog/segment-anything-2/

By @swyx - 9 months

i covered SAM 1 a year ago (https://news.ycombinator.com/item?id=35558522). notes from quick read of the SAM 2 paper https://ai.meta.com/research/publications/sam-2-segment-anyt...

1. SAM 2 was trained on 256 A100 GPUs for 108 hours (SAM1 was 68 hrs on same cluster). Taking the upper end $2 A100 cost off gpulist means SAM2 cost ~$50k to train - surprisingly cheap for adding video understanding?

2. new dataset: the new SA-V dataset is "only" 50k videos, with careful attention given to scene/object/geographical diversity incl that of annotators. I wonder if LAION or Datacomp (AFAICT the only other real players in the open image data space) can reach this standard..

3. bootstrapped annotation: similar to SAM1, a 3 phase approach where 16k initial annotations across 1.4k videos was then expanded to 63k+197k more with SAM 1+2 assistance, with annotation time accelerating dramatically (89% faster than SAM1 only) by the end

4. memory attention: SAM2 is a transformer with memory across frames! special "object pointer" tokens stored in a "memory bank" FIFO queue of recent and prompted frames. Has this been explored in language models? whoa?

(written up in https://x.com/swyx/status/1818074658299855262)

By @minimaxir - 9 months

The web demo is actually pretty neat: https://sam2.metademolab.com/demo

I selected each shoe as individual objects and the model was able to segment them even as they overlapped.

By @phillypham - 9 months

Really cool. Doesn't really work for juggling unfortunately, https://sam2.metademolab.com/shared/fa993f12-b9ce-4f19-bb75-...

By @Imnimo - 9 months

I think the first SAM is the open source model I've gotten the most mileage out of. Very excited to play around with SAM2!

By @Tostino - 9 months

I wish there was a similar model like this, but for (long context) text.

Would be extremely useful to be able to semantically "chunk" text for RAG applications compared to the generally naive strategies employed today.

If I somehow overlooked it, would be very interested in hearing about what you've seen.

By @nullandvoid - 9 months

Anyone have any home project ideas (or past work) to apply this to / inspire others?

I was initially thinking the obvious case would be some sort of system for monitoring your plant health. It could check for shrinkage / growth, colour change etc and build some sort of monitoring tool / automated watering system off that.

By @daemonologist - 9 months

Nice! Of particular interest to me is the slightly improved mIoU and 6x speedup on images [1] (though they say the speedup is mainly from the more efficient encoder, so multiple segmentations of the same image presumably would see less benefit?). It would also be nice to get a comparison to original SAM with bounding box inputs - I didn't see that in the paper though I may have missed it.

[1] - page 11 of https://ai.meta.com/research/publications/sam-2-segment-anyt...

By @albert_e - 9 months

How do these techniques handle transparent, translucent, mesh/gauge/hair like objects that interact with background.

Splashing water or Orange juice, spraying snow from skis, rain and snowfall, foliage, fences and meshes, veils etc.

By @gpjanik - 9 months

Hi from Germany. In case you were wondering, we regulated ourselves to the point where I can't even see the demo of SAM2 until some other service than Meta deploys it.

Does anyone know if this already happened?

By @gpm - 9 months

> This research demo is not open to residents of, or those accessing the demo from, the States of Illinois or Texas.

Alright, I'll bite, why not?

By @glandium - 9 months

> We extend SAM to video by considering images as a video with a single frame.

I can't make sense of this sentence. Is there some mistake?

By @ks2048 - 9 months

I would like to train a model to classify frames in a video (and identify "best" frame for something I want to locate, according to my training data).

Is SAM-2 useful to use as a base model to finetune a classifier layer on? Or are there better options today?

By @simonw - 9 months

Has anyone built anything cool with the original SAM? What did you build?

By @pgt - 9 months

Wonder if I can use this to count my winter wood stock. Before resuscitating my mutilated Python environment, could someone please run this on a photo of stacked uneven bluegum logs to see if it can segment the pieces? OpenCV edge detection does not cut it:

https://share.icloud.com/photos/090J8n36FAd0_lz4tz-TJfOhw

By @zengineer - 9 months

Would love to use it for my startup, but I believe it is to self-host on a server with GPU? Or is there an easy to use API?

By @Mxbonn - 9 months

What happened to text prompts that were shown as early results in SAM1? I assume they never really got them working well?

By @doubleorseven - 9 months

Thank you for this amazing work you are sharing.

I do have a 2 questions: 1. isn't addressing the video frame by frame expensive? 2. In the web demo when the leg moves fast it loses it's track from the shoe. Does the memory part not throwing some uristics to over come this edge case?

By @pzo - 9 months

Impressive, wondering if this is now out of the box fast enough to run on iphone. Previous SAM had some community projects such as FastSAM, MobileSAM, EfficientSAM that tried to speed up. Wish when Readme reporting FPS, provided on what hardware it was tested

By @shaunregenbaum - 9 months

Very excited to give it a try, SAM has had great performance in Biology applications.

By @kamil2000 - 9 months

Already live on Encord - https://encord.com/blog/sam2-now-in-encord/

By @ei8htyfi5e - 9 months

Will it handle tracking out of frame?

i.e. if I stand in the center of my room and take a video of the room spinning around slowly over 5 seconds. Then reverse spin around for 5 seconds.

Will it see the same couch? Or will it see two couches?

By @gpm - 9 months

Interesting how you can bully the model into accepting multiple people as one object, but it keeps trying to down-select to just one person (which you can then fix by adding another annotated frame in).

By @j0e1 - 9 months

This is great! Can someone point me to examples how to bundle something like to run offline on a browser, if possible at all?

By @naitgacem - 9 months

Anyone managed to get this to work on Google collab? I am having trouble with the imports and not sure what is going on.

By @_giorgio_ - 9 months

Does it segment and describe or recognize objects? What "pipeline" would be needed to achieve that? Thanks.

By @renewiltord - 9 months

This is a super-useful model. Thanks, guys.

By @blackeyeblitzar - 9 months

Somewhat related: is there much research into how these models can be tricked or possible security implications?

By @vanjajaja1 - 9 months

Cool! Seems this is cuda only?

By @carbocation - 9 months

Huge fan of the SAM loss function. Thanks for making this.

By @unnouinceput - 9 months

Trying to run https://sam2.metademolab.com/demo and...

Quote: "Sorry Firefox users! The Firefox browser doesn’t support the video features we’ll need to run this demo. Please try again using Chrome or Safari."

Wtf is this shit? Seriously!

By @sails - 9 months

Any use of this category of tools in OCR?

By @ximilian - 9 months

Roughly how many fps could you get running this on a raspberry pi?

By @vicentwu - 9 months

It's amazing!

By @Imagesegmetanto - 9 months

Awesome! Loved SAM already made our Segmentation problem so so so much better.

I was wondering why the original one got deprecated.

Is there now also a good way for finetuning from the officaial / your side?

Any benchmarks against SAM1?

By @maxdo - 9 months

How many days it will take to see this in military use killing people …

Sam 2: Segment Anything in Images and Videos

Related

Video annotator: a framework for efficiently building video classifiers

Show HN: AI assisted image editing with audio instructions

LivePortrait: A fast, controllable portrait animation model

From the Tensor to Stable Diffusion

TreeSeg: Hierarchical Topic Segmentation of Large Transcripts

Related

Video annotator: a framework for efficiently building video classifiers

Show HN: AI assisted image editing with audio instructions

LivePortrait: A fast, controllable portrait animation model

From the Tensor to Stable Diffusion

TreeSeg: Hierarchical Topic Segmentation of Large Transcripts