📄 About Me

I am a master’s student at the Institute of Automation, Chinese Academy of Sciences (CASIA), affiliated with the National Laboratory of Pattern Recognition. I am supervised by Prof. Haiyun Guo and under the leadership of Prof. Jinqiao Wang. I am also a member of the Zidongtaichu Foundation Model Research Center. I am currently an intern at the Wenxin (ERNIE Bot) team at Baidu.

My research centers on two directions:

  • Multimodal Retrieval: I explore reasoning-enhanced retrieval paradigms, fine-grained instance-level retrieval, and composed image retrieval. Representative works include TRACE, PLUME, REIR/CLARE, UniFGVC, WISER, and ReCALL.
  • Unified Understanding-Generation Models: I investigate how to leverage generation capabilities to assist understanding in unified multimodal models, with applications in spatial reasoning and degraded image understanding. Representative works include COOPER and CLEAR.

🔥 News

  • 2026.02:  🎉🎉 Our paper “COOPER” was accepted to CVPR 2026!
  • 2026.02:  🎉🎉 Our paper “WISER” was accepted to CVPR 2026!
  • 2026.02:  🎉🎉 Our paper “ReCALL” was accepted to CVPR 2026!
  • 2025.07:  🎉🎉 Our paper “Referring Expression Instance Retrieval and A Strong End-to-End Baseline” was accepted to ACM MM 2025!
  • 2025.07:   Started research internship at Baidu Wenxin (ERNIE Bot) team.
  • 2025.01:   Joined Zidongtaichu for research internship on foundation models.

📝 Publications

CVPR 2026
sym

COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

Zefeng Zhang*, Xiangzhao Hao*, Hengzhu Tang, Zhenyu Zhang, Jiawei Sheng, Xiaodong Li, Zhenyang Li, Li Gao, Daiting Shi, Dawei Yin, Tingwen Liu

  • We propose a cooperative perception-reasoning unified framework for spatial intelligence, where the model autonomously generates auxiliary visual information (e.g., depth maps) as part of a multimodal chain-of-thought.
  • Designed a SFT+GRPO two-stage framework with Cooperative Perception-Reasoning Reward, achieving 6.91% average improvement on spatial reasoning benchmarks.
CVPR 2026
sym

WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

Tianyue Wang, Leigang Qu, Tianyu Yang, Xiangzhao Hao, Yifan Xu, Haiyun Guo, Jinqiao Wang

Code

  • We propose WISER, a training-free framework for zero-shot composed image retrieval with wider search, deeper thinking, and adaptive fusion.
  • Achieves 45% relative improvement on CIRCO mAP@5 and 57% on CIRR Recall@1 over prior training-free methods, even surpassing many training-based approaches.
CVPR 2026
sym

ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval

Tianyu Yang, Chenwei He, Xiangzhao Hao, Tianyue Wang, Jiarui Guo, Haiyun Guo, Leigang Qu, Jinqiao Wang, Tat-Seng Chua

Code

  • We propose ReCALL, an iterative training framework for MLLM-based composed image retrieval with hard negative mining and foundation model data augmentation.
  • Effectively recalibrates capability degradation in MLLMs for composed image retrieval tasks.
ACM MM 2025
sym

Referring Expression Instance Retrieval and A Strong End-to-End Baseline

Xiangzhao Hao, Kuan Zhu, Hongyu Guo, Haiyun Guo, Ning Jiang, Quan Lu, Ming Tang, Jinqiao Wang

Project

  • We propose a novel multimodal task—Referring Expression Instance Retrieval (REIR), which aims to retrieve and localize a specific object instance from a gallery of images based on a natural language description.
  • Constructed the first large-scale dataset REIRCOCO (30K+ images, 215K+ instances, 613K+ descriptions) and proposed end-to-end dual-stream model CLARE.
TMM
sym

UniFGVC: Universal Fine-Grained Vision Classification via Multimodal Retrieval

Hongyu Guo, Xiangzhao Hao*, Jiarui Guo, Haiyun Guo, Jinqiao Wang, Tat-Seng Chua

  • We reformulate few-shot fine-grained visual classification as a multimodal retrieval problem and design CDV-Captioner for CoT-guided discriminative descriptions.
  • Surpasses existing few-shot SOTA by 5.52% across 12 FGVC benchmarks, also outperforming multiple fully-supervised MLLM methods.

📋 Under Review

Under Review
sym

TRACE: Reasoning-Guided Representation Learning for Universal Multimodal Retrieval

Xiangzhao Hao, Shijie Wang, Tianyu Yang, Tianyue Wang, Haiyun Guo, Jinqiao Wang

  • We propose a “reason-then-encode” retrieval paradigm. TRACE integrates task-adaptive CoT generation with discriminative representation learning in MLLMs, with a difficulty-aware routing strategy that autonomously decides whether to activate reasoning.
  • Built M-BEIR-CoT large-scale dataset and achieved SOTA on M-BEIR benchmark with strong zero-shot transfer on 13 unseen datasets.
Under Review
sym

PLUME: Latent Reasoning Based Universal Multimodal Embedding

Chenwei He*, Xiangzhao Hao*, Tianyu Yang, Yuxiang Ma, Yuheng Jia, Lingxiang Wu, Chaoyang Zhao, Haiyun Guo, Jinqiao Wang

Code | Project

  • We propose PLUME, internalizing explicit CoT into latent reasoning trajectories with only 8 latent steps, achieving 30.3x inference speedup over explicit CoT methods.
  • Achieves 61.6 overall score on MMEB-v2 benchmark (78 tasks), outperforming UME-R1 and VLM2Vec-V2, with significant gains on video (+1.9) and visual document (+3.6) tasks.

💻 Internships

  • 2025.07 - Present, Baidu - Wenxin (ERNIE Bot) Team, Beijing, China.
    • Participated in ERNIE Bot 5.0 pretraining: data processing, cleaning pipeline, and benchmark evaluation.
    • Worked on ernie-one unified understanding-generation model pretraining: architecture survey, thinking-then-generation and interleaved generation data pipeline development, and continued pretraining experiments.
  • 2025.01 - 2026.06, Zidongtaichu - Foundation Model Research Center, Beijing, China.
    • Trained the image-text retrieval module of the Zidongtaichu embedding model.
    • Contributed to local retrieval project, enabling efficient multimodal early warning in the Shiyukunchuan Large Model.

📖 Educations

  • 2023.09 - 2026.06 (expected), M.Eng. in Pattern Recognition, Institute of Automation, University of Chinese Academy of Sciences. GPA: 3.76/4.00.
  • 2019.09 - 2023.06, B.Eng. in Computer Science and Technology (Second major: Mathematics and Applied Mathematics), School of Intelligence and Computing, Tianjin University. Ranked 6/139, GPA: 3.87/4.00.

💡 Patents

  • Zhu Kuan, Guo Haiyun, Hao Xiangzhao, Tang Ming, Wang Jinqiao. Multi-turn Image-Text Understanding and Localization Method and Device Based on Unified Multimodal and Multi-form Representations. CN202411282777.1 [P]. 2024-10-18.
  • Zhu Kuan, Guo Haiyun, Hao Xiangzhao, Tang Ming, Wang Jinqiao. Image-Text Information Processing Method, Apparatus, Device, Storage Medium, and Program Product. CN202411297843.2 [P]. 2025-02-11.

🎖 Honors and Awards