Unified multimodal models (UMMs) that integrate both visual understanding and generation have achieved remarkable success. However, their performance degrades substantially when encountering degraded images common in real-world scenarios (e.g., noise, blur, low resolution). We propose CLEAR, a framework that unlocks the generative potential of UMMs to enhance degraded image understanding. CLEAR introduces an interleaved reasoning paradigm where the model learns to adaptively decide whether to invoke image restoration before answering, through a three-stage training pipeline: (1) Supervised Fine-Tuning with corruption-aware interleaved reasoning data, (2) Bridge Training that maps denoised VAE latents directly into the LLM's understanding space via a latent representation bridge, and (3) Interleaved GRPO with multi-reward optimization (accuracy, format, decision, and latent quality rewards) that jointly optimizes reasoning, generation, and adaptive restoration decisions. We also introduce MMD-Bench, a comprehensive benchmark covering 16 corruption types at 3 severity levels. Experiments show that CLEAR significantly improves robustness to image degradation while maintaining strong performance on clean images, reducing the clean-to-degraded performance drop by 27% compared to the backbone model.
Built on the BAGEL unified multimodal framework, CLEAR proposes a three-stage training pipeline:
<think> / <image_restore> / <answer> tokens.
A key challenge in using generative restoration for understanding is bridging the generated latent representations back into the LLM’s input space. Prior approaches decode to pixels and re-encode, incurring significant overhead. CLEAR proposes a latent representation bridge that directly maps denoised VAE tokens into the LLM’s understanding space.
Results under Hard degradation. R-Bench-Dis is an existing degraded-image benchmark; the remaining six are from MMD-Bench.
| Method | MMBench | MM-Vet | MMVP | CV-Bench | MMStar | RealWorldQA | R-Bench-Dis | AVG |
|---|---|---|---|---|---|---|---|---|
| Closed-source models | ||||||||
| GPT-4o-mini | 67.02 | 50.91 | 64.00 | 59.87 | 45.93 | 58.95 | 61.21 | 58.27 |
| GPT-4.1-mini | 76.08 | 51.88 | 71.00 | 74.96 | 60.73 | 72.41 | 72.52 | 68.51 |
| Gemini-2.5-Flash | 79.33 | 66.55 | 72.33 | 76.01 | 62.00 | 69.15 | 72.72 | 71.16 |
| Open-source unified models | ||||||||
| Emu3 | 53.71 | 21.51 | 65.00 | 58.34 | 42.06 | 52.55 | 55.15 | 49.76 |
| Janus-Pro | 55.57 | 31.33 | 52.66 | 66.75 | 41.53 | 43.52 | 49.09 | 48.64 |
| Bagel | 67.88 | 45.09 | 65.66 | 64.81 | 55.53 | 58.43 | 61.64 | 60.15 |
| CLEAR variants (Bagel backbone) | ||||||||
| Text-only CoT | 63.62 | 48.30 | 70.33 | 64.18 | 56.93 | 53.98 | 62.82 | 60.02 |
| CLEAR-SFT | 72.06 | 47.56 | 70.33 | 70.51 | 57.67 | 60.13 | 65.65 | 63.42 |
| CLEAR-RL | 72.52 | 51.97 | 71.33 | 72.25 | 60.67 | 61.05 | 67.07 | 65.26 |
Clean and Hard scores are averaged over the six MMD-Bench benchmarks. Drop = Clean − Hard.
| Method | Clean | Hard | Drop (↓) |
|---|---|---|---|
| Bagel | 66.86 | 59.57 | 7.29 |
| CLEAR-SFT | 69.34 | 63.04 | 6.30 |
| CLEAR-RL | 70.27 | 64.96 | 5.31 |
CLEAR adaptively decides whether to invoke image restoration based on the severity of degradation.
For mildly degraded images, the model answers directly. For heavily degraded images, the model
triggers <image_restore> before answering.
As degradation severity increases from Low to High, CLEAR triggers image restoration more frequently, demonstrating the learned adaptive decision-making optimized through the decision reward in Interleaved GRPO.
We introduce MMD-Bench (Multimodal Model Degradation Benchmark), a comprehensive evaluation benchmark covering 16 corruption types across 4 categories (Capture, Transmission, Environment, Post-processing) at 3 severity levels (Low, Mid, High). MMD-Bench is constructed by applying controlled degradations to 6 existing VLM benchmarks.
@misc{hao2026clearunlockinggenerativepotential,
title={CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models},
author={Xiangzhao Hao and Zefeng Zhang and Zhenyu Zhang and Linhao Yu and Yao Chen and Yiqian Zhang and Haiyun Guo and Shuohuan Wang and Yu Sun},
year={2026},
eprint={2604.04780},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.04780},
}