Referring Expression Instance Retrieval and A Strong End-to-End Baseline

Xiangzhao Hao¹, Kuan Zhu¹, Hongyu Guo¹, Haiyun Guo¹, Ning Jiang²,
Quan Lu², Ming Tang¹, Jinqiao Wang¹

¹Institute of Automation, Chinese Academy of Sciences
²CMashang Consumer Finance Co, Ltd

Paper Dataset Code arXiv

Teaser

Comparison of three vision-language tasks and their datasets. REIR enables end-to-end instance-level retrieval and localization, addressing the limitations of Text-Image retrieval and Referring Expression Comprehension.

Introduction

Natural language querying of visual information is fundamental to many real-world applications, motivating a broad range of vision-language tasks. Text-Image Retrieval (TIR) retrieves an image based on an image-level description, while Referring Expression Comprehension (REC) localizes a target object in an image using an instance-level description. However, neither can fully handle real-world demands where users often search for specific object instances across large galleries and expect both the relevant image and the exact object location.

We introduce a new task, Referring Expression Instance Retrieval (REIR), which jointly supports instance-level retrieval and localization based on fine-grained referring expressions. To support this task, we construct REIRCOCO, a large-scale benchmark with high-quality expressions generated via prompting vision-language foundation models on MSCOCO and RefCOCO.

We also propose CLARE (Contrastive Language-Instance Alignment with Relation Experts), a dual-stream model designed for REIR. It leverages a vision branch for extracting instance-level object features and a language branch enhanced by a Mix of Relation Experts (MORE) module. Retrieval and localization are jointly optimized using a novel CLIA contrastive alignment objective. Experiments show CLARE outperforms strong baselines on REIR and generalizes well to TIR and REC tasks.

Dataset

Overall Structure

Overview of the REIRCOCO dataset construction pipeline. REIRCOCO is a large-scale benchmark specifically designed for instance-level retrieval and localization. It features uniquely aligned referring expressions for over 215,000 object instances in 30,000+ images, totaling 613,000 fine-grained descriptions. The dataset is constructed through a two-stage pipeline: In the generation stage, GPT-4o is prompted with structured inputs—including bounding boxes, category labels, captions, and object context—to generate diverse and referentially unique expressions. In the filtering stage, DeepSeek-VL verifies expression quality, retaining only unambiguous, grounded, and semantically accurate descriptions. This ensures that each expression matches exactly one object instance, making REIRCOCO highly suitable for both retrieval and localization tasks.

Method

Overall Structure

Overview of the proposed CLARE framework. CLARE (Contrastive Language-Instance Alignment with Relation Experts) is a dual-stream architecture designed for instance-level retrieval and localization. It processes vision and language in parallel while maintaining strong cross-modal alignment. On the visual side, a SigLIP encoder and a Deformable-DETR-based object extractor generate dense, context-aware object features. The language side uses a referring expression encoder enhanced by the Mix of Relation Experts (MORE) module, which injects semantic, spatial, and relational cues. Cross-image instance-level alignment is supervised by the CLIA objective, which extends SigLIP’s contrastive loss for object-level understanding. Focal Loss is further used to enhance discriminative instance selection within each image. Together, these components enable high-precision retrieval and localization across large-scale galleries.

Results

REIR Performance Comparison

CLARE achieves state-of-the-art performance on the REIR benchmark. As illustrated in Table 1, CLARE consistently outperforms all two-stage baselines across various IoU thresholds and ranking metrics. Unlike conventional pipelines that separate retrieval and localization—leading to error accumulation—CLARE integrates both tasks into a single, unified framework. This end-to-end design enables more precise cross-modal alignment, improves robustness, and significantly reduces inference cost.

REC Benchmark Comparison

CLARE demonstrates competitive results on REC benchmarks (RefCOCO, RefCOCO+, RefCOCOg). Despite its dual-stream architecture, CLARE matches or outperforms strong one-stream and MLLM-based methods. Its success stems from a powerful instance-level contrastive alignment strategy, which enables effective semantic and spatial matching across the batch. CLARE excels on RefCOCO and RefCOCOg, where relational and spatial cues are crucial, and remains robust even on RefCOCO+, where such cues are reduced—showcasing strong generalization ability across diverse REC settings.

Qualitative Results

Qualitative Results

Qualitative Results

BibTeX

@article{hao2025referring,
        title={Referring Expression Instance Retrieval and A Strong End-to-End Baseline},
        author={Hao, Xiangzhao and Zhu, Kuan and Guo, Hongyu and Guo, Haiyun and Tang, Ming and Wang, JinQiao},
        journal={arXiv preprint arXiv:2506.18246},
        year={2025}
      }