Grounding DINO Explained
Open-Set Detection leverages the learning of region-aware region embeddings, so that each region can be classified into novel categories in a language aware semantic space.[1]
TBC
Bullet points:
The success of Grounding DINO [2] is attributed to the effective fusion of vision and language modalities from the very early on.
- Feature enhancement
- language-guided query selection
- cross-modality decoder for cross modality fusion
References
[1] Zhang, Hao, et al. “Dino: Detr with improved denoising anchor boxes for end-to-end object detection.” arXiv preprint arXiv:2203.03605 (2022).
[2] Liu, Shilong, et al. “Grounding dino: Marrying dino with grounded pre-training for open-set object detection.” arXiv preprint arXiv:2303.05499 (2023).