SAM tried to build a foundation model for segmentation, which consists of three interconnected parts.
promptable segmentation task
prompts can be fed into the model in various forms, e.g. points, box rectangles, free form texts or just masks.
prompts and image labels guide the model to produce reliable and precise segmentation masks.
In order to achieve real time or amortized real-time inference, the model leverages an image encoder to encode the image into image embeddings that are readily to be queried, while various kind of queries are encoded by the prompt encoder under an uniform interface.
The image encoder is huge, so image encoding is slow. Because of the adoption of the prompt-image two paths architecture, the image can be encoded before queries, and the image embeddings can be reused for all following queries without the need to re-encode, so the expensive time cost of encoding can be amortized.
image encodings and prompt encoding are fed into a lightweight decoder that produce segmentation maps.
data engine that aggregates large scale data annotations
data accumulation undergoes three phases
- manually assisted
 Kirillov, Alexander, et al. “Segment anything.” arXiv preprint arXiv:2304.02643 (2023).