Apr 28, 2026arXiv:2604.25457

GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution

Fabio D'Oronzio, Federico Putamorsi, Leonardo Zini, Marcella Cornia, L. Baraldi

AI Summary

This paper introduces GramSR, a one-step diffusion-based super-resolution framework that replaces text conditioning with dense visual features extracted from a pre-trained DINOv3 encoder. GramSR employs a three-stage LoRA architecture, sequentially training pixel-level, semantic-level, and texture-level modules with specific loss functions to address degradation removal, perceptual detail enhancement, and texture preservation. Experiments on standard SR benchmarks demonstrate that GramSR outperforms existing one-step diffusion-based methods by achieving superior structural fidelity and texture realism.

Key Contribution

Ditching text-based conditioning for visual features in diffusion-based super-resolution unlocks significantly improved structural fidelity and texture realism.

Abstract

Despite recent advances, single-image super-resolution (SR) remains challenging, especially in real-world scenarios with complex degradations. Diffusion-based SR methods, particularly those built on Stable Diffusion, leverage strong generative priors but commonly rely on text conditioning derived from semantic captioning. Such textual descriptions provide only high-level semantics and lack the spatially aligned visual information required for faithful restoration, leading to a representation gap between abstract semantics and spatially aligned visual details. To address this limitation, we propose GramSR, a one-step diffusion-based SR framework that replaces text conditioning with dense visual features extracted from the low-resolution input using a pre-trained DINOv3 encoder. GramSR adopts a three-stage LoRA architecture, where pixel-level, semantic-level, and texture-level LoRA modules are trained sequentially. The pixel-level module focuses on degradation removal using $\ell_2$ loss, the semantic-level module enhances perceptual details via LPIPS and CSD losses, and the texture-level module enforces feature correlation consistency through a Gram matrix loss computed from DINOv3 features. At inference, independent guidance scales enable flexible control over degradation removal, semantic enhancement, and texture preservation. Extensive experiments on standard SR benchmarks demonstrate that GramSR consistently outperforms existing one-step diffusion-based methods, achieving superior structural fidelity and texture realism. The code for this work is available at: https://github.com/aimagelab/GramSR.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References48

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution

Related Papers