CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

Xiangzhao Hao1*, Zefeng Zhang1*, Zhenyu Zhang1, Linhao Yu1, Yao Chen1, Yiqian Zhang1, Haiyun Guo1, Shuohuan Wang1, Yu Sun1
1Baidu Inc.
*Equal Contribution
Motivation: existing multimodal models suffer performance drops on degraded images
Figure 1. Existing multimodal models suffer significant performance drops on degraded images. CLEAR learns to adaptively invoke image restoration before answering when degradation is severe.

Abstract

Unified multimodal models (UMMs) that integrate both visual understanding and generation have achieved remarkable success. However, their performance degrades substantially when encountering degraded images common in real-world scenarios (e.g., noise, blur, low resolution). We propose CLEAR, a framework that unlocks the generative potential of UMMs to enhance degraded image understanding. CLEAR introduces an interleaved reasoning paradigm where the model learns to adaptively decide whether to invoke image restoration before answering, through a three-stage training pipeline: (1) Supervised Fine-Tuning with corruption-aware interleaved reasoning data, (2) Bridge Training that maps denoised VAE latents directly into the LLM's understanding space via a latent representation bridge, and (3) Interleaved GRPO with multi-reward optimization (accuracy, format, decision, and latent quality rewards) that jointly optimizes reasoning, generation, and adaptive restoration decisions. We also introduce MMD-Bench, a comprehensive benchmark covering 16 corruption types at 3 severity levels. Experiments show that CLEAR significantly improves robustness to image degradation while maintaining strong performance on clean images, reducing the clean-to-degraded performance drop by 27% compared to the backbone model.

Method

Built on the BAGEL unified multimodal framework, CLEAR proposes a three-stage training pipeline:

  • Stage 1 — SFT: Corruption-aware supervised fine-tuning teaches the model to reason about degradation using interleaved <think> / <image_restore> / <answer> tokens.
  • Stage 2 — Bridge Training: A latent representation bridge maps denoised VAE latents directly into the LLM’s token space, avoiding costly decode-reencode.
  • Stage 3 — Interleaved GRPO: Group Relative Policy Optimization with four reward signals (accuracy, format, decision, latent quality) jointly optimizes the model’s reasoning, generation, and adaptive restoration decisions.
CLEAR pipeline: three-stage training
Figure 2. Overview of the CLEAR training pipeline. Stage 1 (SFT) trains corruption-aware interleaved reasoning with CE + MSE + Distill losses. Stage 3 (Interleaved GRPO) uses multi-reward optimization combining standard GRPO and Flow-GRPO.

Latent Representation Bridge

A key challenge in using generative restoration for understanding is bridging the generated latent representations back into the LLM’s input space. Prior approaches decode to pixels and re-encode, incurring significant overhead. CLEAR proposes a latent representation bridge that directly maps denoised VAE tokens into the LLM’s understanding space.

Latent Representation Bridge vs Decode-Reencode
Figure 3. Left: Decode-Reencode requires full VAE decode + ViT/VAE re-encoding. Right: Our Latent Representation Bridge directly maps denoised latents into the LLM’s token space.

Main Results

Results under Hard degradation. R-Bench-Dis is an existing degraded-image benchmark; the remaining six are from MMD-Bench.

Method MMBench MM-Vet MMVP CV-Bench MMStar RealWorldQA R-Bench-Dis AVG
Closed-source models
GPT-4o-mini 67.0250.9164.0059.8745.9358.9561.2158.27
GPT-4.1-mini 76.0851.8871.0074.9660.7372.4172.5268.51
Gemini-2.5-Flash 79.3366.5572.3376.0162.0069.1572.7271.16
Open-source unified models
Emu3 53.7121.5165.0058.3442.0652.5555.1549.76
Janus-Pro 55.5731.3352.6666.7541.5343.5249.0948.64
Bagel 67.8845.0965.6664.8155.5358.4361.6460.15
CLEAR variants (Bagel backbone)
Text-only CoT 63.6248.3070.3364.1856.9353.9862.8260.02
CLEAR-SFT 72.0647.5670.3370.5157.6760.1365.6563.42
CLEAR-RL 72.5251.9771.3372.2560.6761.0567.0765.26

Robustness Analysis

Clean and Hard scores are averaged over the six MMD-Bench benchmarks. Drop = Clean − Hard.

Method Clean Hard Drop (↓)
Bagel 66.8659.577.29
CLEAR-SFT 69.3463.046.30
CLEAR-RL 70.2764.965.31

Qualitative Results

CLEAR adaptively decides whether to invoke image restoration based on the severity of degradation. For mildly degraded images, the model answers directly. For heavily degraded images, the model triggers <image_restore> before answering.

Qualitative results: easy vs hard degradation
Figure 4. Left: Easy degradation — the model answers directly without restoration. Right: Hard degradation — the model triggers restoration before answering correctly.
Additional qualitative examples with restoration
Figure 5. Additional examples showing CLEAR’s restoration-assisted reasoning pipeline. The model detects degradation, invokes image restoration, and then answers correctly.

Adaptive Restoration Behavior

As degradation severity increases from Low to High, CLEAR triggers image restoration more frequently, demonstrating the learned adaptive decision-making optimized through the decision reward in Interleaved GRPO.

Generation triggering rate across degradation levels
Figure 6. Generation triggering rate (%) and inference time across six benchmarks at Low, Mid, and High degradation levels.

MMD-Bench

We introduce MMD-Bench (Multimodal Model Degradation Benchmark), a comprehensive evaluation benchmark covering 16 corruption types across 4 categories (Capture, Transmission, Environment, Post-processing) at 3 severity levels (Low, Mid, High). MMD-Bench is constructed by applying controlled degradations to 6 existing VLM benchmarks.

16 corruption types at 3 severity levels
Figure 7. Visualization of 16 corruption types at 3 severity levels used in MMD-Bench, organized by category: Capture (blue), Transmission (yellow), Environment (green), Post-processing (red).

BibTeX

@misc{hao2026clearunlockinggenerativepotential,
      title={CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models},
      author={Xiangzhao Hao and Zefeng Zhang and Zhenyu Zhang and Linhao Yu and Yao Chen and Yiqian Zhang and Haiyun Guo and Shuohuan Wang and Yu Sun},
      year={2026},
      eprint={2604.04780},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.04780},
}