Grounding DINO Explained

A Small Step Towards Open-Set Object Detection

Aug 22, 2023

Open-Set Detection leverages the learning of region-aware region embeddings, so that each region can be classified into novel categories in a language aware semantic space.[1]


Bullet points:

The success of Grounding DINO [2] is attributed to the effective fusion of vision and language modalities from the very early on.

  1. Feature enhancement
  2. language-guided query selection
  3. cross-modality decoder for cross modality fusion


[1] Zhang, Hao, et al. “Dino: Detr with improved denoising anchor boxes for end-to-end object detection.” arXiv preprint arXiv:2203.03605 (2022).

[2] Liu, Shilong, et al. “Grounding dino: Marrying dino with grounded pre-training for open-set object detection.” arXiv preprint arXiv:2303.05499 (2023).