CLEAR: Unlocking Generative Potential for Degraded Image Understanding

Motivation: existing multimodal models suffer performance drops on degraded images — **Figure 1.** Existing multimodal models suffer significant performance drops on degraded images. CLEAR learns to adaptively invoke image restoration before answering when degradation is severe.

Abstract

Unified multimodal models (UMMs) that integrate both visual understanding and generation have achieved remarkable success. However, their performance degrades substantially when encountering degraded images common in real-world scenarios (e.g., noise, blur, low resolution). We propose CLEAR, a framework that unlocks the generative potential of UMMs to enhance degraded image understanding. CLEAR introduces an interleaved reasoning paradigm where the model learns to adaptively decide whether to invoke image restoration before answering, through a three-stage training pipeline: (1) Supervised Fine-Tuning with corruption-aware interleaved reasoning data, (2) Bridge Training that maps denoised VAE latents directly into the LLM's understanding space via a latent representation bridge, and (3) Interleaved GRPO with multi-reward optimization (accuracy, format, decision, and latent quality rewards) that jointly optimizes reasoning, generation, and adaptive restoration decisions. We also introduce MMD-Bench, a comprehensive benchmark covering 16 corruption types at 3 severity levels. Experiments show that CLEAR significantly improves robustness to image degradation while maintaining strong performance on clean images, reducing the clean-to-degraded performance drop by 27% compared to the backbone model.

Method

Built on the BAGEL unified multimodal framework, CLEAR proposes a three-stage training pipeline:

Stage 1 — SFT: Corruption-aware supervised fine-tuning teaches the model to reason about degradation using interleaved <think> / <image_restore> / <answer> tokens.
Stage 2 — Bridge Training: A latent representation bridge maps denoised VAE latents directly into the LLM’s token space, avoiding costly decode-reencode.
Stage 3 — Interleaved GRPO: Group Relative Policy Optimization with four reward signals (accuracy, format, decision, latent quality) jointly optimizes the model’s reasoning, generation, and adaptive restoration decisions.

CLEAR pipeline: three-stage training — **Figure 2.** Overview of the CLEAR training pipeline. **Stage 1** (SFT) trains corruption-aware interleaved reasoning with CE + MSE + Distill losses. **Stage 3** (Interleaved GRPO) uses multi-reward optimization combining standard GRPO and Flow-GRPO.

Latent Representation Bridge

A key challenge in using generative restoration for understanding is bridging the generated latent representations back into the LLM’s input space. Prior approaches decode to pixels and re-encode, incurring significant overhead. CLEAR proposes a latent representation bridge that directly maps denoised VAE tokens into the LLM’s understanding space.

Main Results

Results under Hard degradation. R-Bench-Dis is an existing degraded-image benchmark; the remaining six are from MMD-Bench.

Method	MMBench	MM-Vet	MMVP	CV-Bench	MMStar	RealWorldQA	R-Bench-Dis	AVG
Closed-source models
GPT-4o-mini	67.02	50.91	64.00	59.87	45.93	58.95	61.21	58.27
GPT-4.1-mini	76.08	51.88	71.00	74.96	60.73	72.41	72.52	68.51
Gemini-2.5-Flash	79.33	66.55	72.33	76.01	62.00	69.15	72.72	71.16
Open-source unified models
Emu3	53.71	21.51	65.00	58.34	42.06	52.55	55.15	49.76
Janus-Pro	55.57	31.33	52.66	66.75	41.53	43.52	49.09	48.64
Bagel	67.88	45.09	65.66	64.81	55.53	58.43	61.64	60.15
CLEAR variants (Bagel backbone)
Text-only CoT	63.62	48.30	70.33	64.18	56.93	53.98	62.82	60.02
CLEAR-SFT	72.06	47.56	70.33	70.51	57.67	60.13	65.65	63.42
CLEAR-RL	72.52	51.97	71.33	72.25	60.67	61.05	67.07	65.26

Robustness Analysis

Clean and Hard scores are averaged over the six MMD-Bench benchmarks. Drop = Clean − Hard.

Method	Clean	Hard	Drop (↓)
Bagel	66.86	59.57	7.29
CLEAR-SFT	69.34	63.04	6.30
CLEAR-RL	70.27	64.96	5.31

Qualitative Results

CLEAR adaptively decides whether to invoke image restoration based on the severity of degradation. For mildly degraded images, the model answers directly. For heavily degraded images, the model triggers <image_restore> before answering.

Additional qualitative examples with restoration — **Figure 5.** Additional examples showing CLEAR’s restoration-assisted reasoning pipeline. The model detects degradation, invokes image restoration, and then answers correctly.

Adaptive Restoration Behavior

As degradation severity increases from Low to High, CLEAR triggers image restoration more frequently, demonstrating the learned adaptive decision-making optimized through the decision reward in Interleaved GRPO.

Generation triggering rate across degradation levels — **Figure 6.** Generation triggering rate (%) and inference time across six benchmarks at Low, Mid, and High degradation levels.

MMD-Bench

We introduce MMD-Bench (Multimodal Model Degradation Benchmark), a comprehensive evaluation benchmark covering 16 corruption types across 4 categories (Capture, Transmission, Environment, Post-processing) at 3 severity levels (Low, Mid, High). MMD-Bench is constructed by applying controlled degradations to 6 existing VLM benchmarks.

**Figure 7.** Visualization of 16 corruption types at 3 severity levels used in MMD-Bench, organized by category: Capture (blue), Transmission (yellow), Environment (green), Post-processing (red).

BibTeX

@misc{hao2026clearunlockinggenerativepotential,
      title={CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models},
      author={Xiangzhao Hao and Zefeng Zhang and Zhenyu Zhang and Linhao Yu and Yao Chen and Yiqian Zhang and Haiyun Guo and Shuohuan Wang and Yu Sun},
      year={2026},
      eprint={2604.04780},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.04780},
}