Natural language querying of visual information is fundamental to many real-world applications, motivating a broad range of vision-language tasks. Text-Image Retrieval (TIR) retrieves an image based on an image-level description, while Referring Expression Comprehension (REC) localizes a target object in an image using an instance-level description. However, neither can fully handle real-world demands where users often search for specific object instances across large galleries and expect both the relevant image and the exact object location.
We introduce a new task, Referring Expression Instance Retrieval (REIR), which jointly supports instance-level retrieval and localization based on fine-grained referring expressions. To support this task, we construct REIRCOCO, a large-scale benchmark with high-quality expressions generated via prompting vision-language foundation models on MSCOCO and RefCOCO.
We also propose CLARE (Contrastive Language-Instance Alignment with Relation Experts), a dual-stream model designed for REIR. It leverages a vision branch for extracting instance-level object features and a language branch enhanced by a Mix of Relation Experts (MORE) module. Retrieval and localization are jointly optimized using a novel CLIA contrastive alignment objective. Experiments show CLARE outperforms strong baselines on REIR and generalizes well to TIR and REC tasks.
@article{hao2025referring,
title={Referring Expression Instance Retrieval and A Strong End-to-End Baseline},
author={Hao, Xiangzhao and Zhu, Kuan and Guo, Hongyu and Guo, Haiyun and Tang, Ming and Wang, JinQiao},
journal={arXiv preprint arXiv:2506.18246},
year={2025}
}