Yun Fu

IR) aims to return the most relevant images from a gallery given a textual query. Recent progress in this task has been largely driven by vision–language models (VLMs) (Jia et al., 2021; Yang et al., 2022; Li et al., 2023; Yang et al., 2025; Lu et al., 2025a; Huang et al., 2025), which learn joint representations of text and images through large-scale pretraining on web-scale image–text pairs (Schuhmann et al., 2021; 2022; Liu et al., 2023). These models significantly narrow the semantic gap between modalities and achieve strong alignment across diverse benchmarks (Ilharco et al., 2021; Singh et al., 2022; Li et al., 2024; Lu et al., 2025b; Dong et al., 2026). Despite these advances, retrieval performance often degrades in realistic scenarios where user queries are very short (typically just one or two words, e.g., “a dog”). Short queries encode only limited semantics, which results in large and ambiguous search subspaces and less discriminative results. This issue becomes more pronounced in large-scale galleries, where underspecified queries yield many candidate matches and cause semantic collisions among visually diverse results. Another limitation of existing retrieval systems is their singular focus on semantic alignment. Naïve retrieval approaches simply return the top-kk images with the highest similarity scores, overlooking other critical aspects of user satisfaction such as aesthetics, interestingness, or popularity (Yi et al., 2023; Abdullahu and Grabner, 2024; Wang et al., 2025). In practice, retrieval quality is context-dependent: art students may prefer visually inspiring images, architects may seek unique and creative references, and shoppers may favor popular or visually appealing products. However, conventional retrieval systems lack mechanisms for steering retrieval toward these quality dimensions. To address these limitations, we introduce the task of quality-controllable retrieval (QCR). Formally, given a frozen VLM and a short textual query, the objective is to retrieve images that not only align semantically but also satisfy user-specified quality requirements. This setting is feasible because short queries occupy a broad region of the embedding space that contains images of varying perceptual quality. With appropriate conditioning, this region can be partitioned into perceptually distinct subsets, enabling fine-grained quality-aware retrieval. In this work, we define retrieval quality along two widely applicable dimensions: relevance (semantic consistency) (Cherti et al., 2023) and aesthetics (visual appeal) (Yi et al., 2023). For each image in the gallery, we construct auxiliary annotations consisting of a textual description, a relevance score, and an aesthetic score. We discretize these continuous scores into categorical quality levels and associate each description with its corresponding quality condition. The central challenge is how to steer retrieval results toward specific quality levels given short queries. We propose a simple yet effective solution: quality-conditioned query completion (QCQC). QCQC enriches short queries with quality-aware details by leveraging a generative large language model (LLM). Trained on the quality-augmented dataset, the LLM learns to append appropriate descriptive phrases that capture both semantic and quality-related attributes. By conditioning on distinct quality levels, QCQC generates targeted query completions that steer retrieval toward specific regions of the embedding space. This capability is particularly valuable in practice, as users often struggle to articulate quality preferences or may lack a clear understanding of what constitutes “high” or “low” quality within a dataset. By modeling how textual descriptions vary across quality levels, our approach bridges this gap and enables more controllable, quality-aware retrieval through conditioned query completion. Our key contributions in this work can be summarized as follows: • A new problem: we introduce quality-controllable retrieval, a new setting where retrieval can be explicitly conditioned on user-defined quality requirements. • A general solution: we propose QCQC, a generative query completion framework that leverages LLMs to enrich short queries with quality-aware descriptive details. • Validation: we conduct extensive experiments to show that QCQC effectively steers retrieval outcomes according to quality preferences and is compatible to multiple VLMs. 2 Preliminaries 2.1 Motivation We study the problem of text-to-image retrieval, where the goal is to return the desired images from a large gallery given a set of natural language queries. Specifically, let 𝒬≔{Q1,…,Qm}\mathcal{Q}\coloneqq\{Q_{1},\dots,Q_{m}\} denote a collection of mm text queries and ℐ≔{I1,…,In}\mathcal{I}\coloneqq\{I_{1},\dots,I_{n}\} an image gallery of size nn. We consider a state-of-the-art VLM as the retrieval backbone, equipped with a text encoder g:𝒬→ℝdg:\mathcal{Q}\to\mathbb{R}^{d} and an image encoder f:ℐ→ℝdf:\mathcal{I}\to\mathbb{R}^{d}, both producing dd-dimensional normalized embeddings. Given a query set 𝒬\mathcal{Q}, the system returns the top-η\eta relevant images according to 𝒳≔sort(f(ℐ),g(𝒬),η),\mathcal{X}\coloneqq\mathrm{sort}\left(f(\mathcal{I}),\ g(\mathcal{Q}),\ \eta\right), (1) where 𝒳⊆ℐ\mathcal{X}\subseteq\mathcal{I} denotes the top-η\eta matches of queries 𝒬\mathcal{Q}. The sort\mathrm{sort} function typically operates on the similarity scores 𝑺∈ℝm×n\bm{S}\in\mathbb{R}^{m\times n} with 𝑺ij≔g(Qi)⊤f(Ij)\bm{S}_{ij}\coloneqq g(Q_{i})^{\top}f(I_{j}). Although modern VLMs achieve strong cross-modal alignment, retrieval performance deteriorates in realistic scenarios where user queries are usually very short (typically just one or two words, e.g., “a dog”). Given such short queries, naïve retrieval system faces several challenges: ① Semantic ambiguity: a few words can refer to a wide range of possible images, leading to a large and diffuse search subspace with less discriminative retrieval results. ② Semantic collisions: short queries tend to yield close similarity scores for visually diverse images. These collisions confuse ranking and are particularly problematic in large-scale galleries where many candidate images match the vague query. ③ Lack of quality control: the quality of retrieved images is not explicitly enforced during retrieval. At best, one can apply post-retrieval filtering, but the system itself provides no mechanism to ensure that high-quality results consistently appear among the top matches. These issues highlight a fundamental gap between the expressive capacity of modern VLMs and the underspecified nature of user queries, motivating the need for query enrichment and controllable retrieval mechanisms. 2.2 Problem Setting To address the above limitations, we propose to enrich short queries with additional descriptive details that potentially capture more distinguishable attributes of images. Formally, let hh denote a query completion function that maps 𝒬\mathcal{Q} to enriched queries h(𝒬)h(\mathcal{Q}). Retrieval is then performed as 𝒳~≔sort(f(ℐ),g(h(𝒬)),η),\widetilde{\mathcal{X}}\coloneqq\mathrm{sort}\left(f(\mathcal{I}),\ g(h(\mathcal{Q})),\ \eta\right), (2) where h(𝒬)h(\mathcal{Q}) augments the short queries with contextual details. The enriched queries are expected to capture not only object categories but also additional information such as pose, scene, action, and fine-grained attributes. To be effective, the completion function should be aware of the retrieval gallery, so that it generates meaningful context rather than irrelevant content. To achieve this, we implement hh using a generative large language model (𝙻𝙻𝙼\mathtt{LLM}). However, simply training the 𝙻𝙻𝙼\mathtt{LLM} on image descriptions is insufficient, since it cannot guarantee that retrieval results satisfy user expectations of quality. Instead, we partition the textual descriptions into non-overlapped quality levels 𝒞\mathcal{C} that reflect different image quality categories. We then finetune the 𝙻𝙻𝙼\mathtt{LLM} with these quality levels, enabling it to generate query completions conditioned on quality preferences. This yields the formulation of our quality-controllable retrieval (QCR) : 𝒳~≔sort(f(ℐ),g(𝙻𝙻𝙼(𝒬;𝒞)),η),\widetilde{\mathcal{X}}\coloneqq\mathrm{sort}\left(f(\mathcal{I}),\ g(\mathtt{LLM}(\mathcal{Q}\ ;\ \mathcal{C})),\ \eta\right), (3) where 𝙻𝙻𝙼(𝒬;𝒞)\mathtt{LLM}(\mathcal{Q};\mathcal{C}) expands the short queries based on the specified quality constraint 𝒞\mathcal{C}. The extended queries thus steer retrieval toward images that align with the desired quality criteria. This approach offers several practical benefits: ① Flexibility: it requires no modification to pretrained VLMs and remains compatible with any VLMs; ② Transparency: the generated query completions are human-readable, allowing users to review and select preferred options. ③ Controllability: the 𝙻𝙻𝙼\mathtt{LLM} can produce different query completions according to distinct quality conditions 𝒞\mathcal{C}, enabling explicit quality control during retrieval. In the following section, we provide theoretical justification for why enriching short queries may improve retrieval performance. 2.3 Theoretical Analysis We model query completion as a structured perturbation and analyze its effect on the similarity matrices 𝑺\bm{S} through the lens of rank variation under perturbations. Let 𝒬+={Q1+,…,Qm+}≔h(Q)\mathcal{Q}^{+}=\{Q_{1}^{+},\dots,Q_{m}^{+}\}\coloneqq h(Q) denote the extended queries by hh, where Qi+≔Qi+suffixi,∀i∈{1,…,m}Q_{i}^{+}\coloneqq Q_{i}+\mathrm{suffix}_{i},\ \forall i\in\{1,\dots,m\}, and suffixi\mathrm{suffix}_{i} denotes additional descriptive details appended to query QiQ_{i}. Let 𝑪∈ℝn×d\bm{C}\in\mathbb{R}^{n\times d} be the image embedding matrix with rows 𝒄j≔f(Ij)∈ℝd,∀j∈{1,…,n}\bm{c}_{j}\coloneqq f(I_{j})\in\mathbb{R}^{d},\ \forall j\in\{1,\dots,n\}, and 𝑨,𝑩∈ℝm×d\bm{A},\bm{B}\in\mathbb{R}^{m\times d} be two sets of text embeddings with a strict one-to-one pairing of rows, with rows 𝒂i≔g(Qi)∈ℝd\bm{a}_{i}\coloneqq g(Q_{i})\in\mathbb{R}^{d} and 𝒃i≔g(Qi+)∈ℝd,∀i∈{1,…,m}\bm{b}_{i}\coloneqq g(Q_{i}^{+})\in\mathbb{R}^{d},\ \forall i\in\{1,\dots,m\}. Let r≔rank(𝑨)r\coloneqq\mathrm{rank}(\bm{A}) be the rank of 𝑨\bm{A}, σr(𝑨)\sigma_{r}(\bm{A}) be the smallest nonzero singular value of 𝑨\bm{A}, and 𝑨=𝑼𝚺𝑽⊤\bm{A}=\bm{U\Sigma V}^{\top} denote its singular value decomposition (SVD). We then partition the right singular vectors as 𝑽=[𝑽S𝑽⟂]\bm{V}=\big[\,\bm{V}_{S}\ \ \bm{V}_{\perp}\,\big], where 𝑽S∈ℝd×r\bm{V}_{S}\in\mathbb{R}^{d\times r} and 𝑽⟂∈ℝd×(d−r)\bm{V}_{\perp}\in\mathbb{R}^{d\times(d-r)} satisfy span(𝑽S)=ℛ(𝑨)\mathrm{span}(\bm{V}_{S})=\mathcal{R}(\bm{A}) and span(𝑽⟂)=ℛ(𝑨)⟂\mathrm{span}(\bm{V}_{\perp})=\mathcal{R}(\bm{A})^{\perp}, with ℛ(𝑨)≔span{𝒂1⊤,…,𝒂m⊤}⊆ℝd\mathcal{R}(\bm{A})\coloneqq\mathrm{span}\{\bm{a}_{1}^{\top},\dots,\bm{a}_{m}^{\top}\}\subseteq\mathbb{R}^{d} the row space of 𝑨\bm{A}. Definition 1. We define a perturbation matrix 𝚫≔𝐁−𝐀∈ℝm×d\bm{\Delta}\coloneqq\bm{B}-\bm{A}\in\mathbb{R}^{m\times d}, score matrices 𝐒A≔𝐀𝐂⊤∈ℝm×n\bm{S}_{A}\coloneqq\bm{A}\bm{C}^{\top}\in\mathbb{R}^{m\times n} and 𝐒B≔𝐁𝐂⊤∈ℝm×n\bm{S}_{B}\coloneqq\bm{B}\bm{C}^{\top}\in\mathbb{R}^{m\times n} for the queries 𝒬\mathcal{Q} and 𝒬+\mathcal{Q}^{+}, 𝐀S≔𝐀𝐕S\bm{A}_{S}\coloneqq\bm{A}\bm{V}_{S}, 𝚫S≔𝚫𝐕S\bm{\Delta}_{S}\coloneqq\bm{\Delta}\bm{V}_{S}, 𝚫⟂≔𝚫𝐕⟂\bm{\Delta}_{\perp}\coloneqq\bm{\Delta}\bm{V}_{\perp}, 𝐂S≔𝐂𝐕S\bm{C}_{S}\coloneqq\bm{C}\bm{V}_{S}, 𝐂⟂≔𝐂𝐕⟂\bm{C}_{\perp}\coloneqq\bm{C}\bm{V}_{\perp}, 𝐗≔(𝐀S+𝚫S)𝐂S⊤\bm{X}\coloneqq(\bm{A}_{S}+\bm{\Delta}_{S})\bm{C}_{S}^{\top}, 𝐘≔𝚫⟂𝐂⟂⊤\bm{Y}\coloneqq\bm{\Delta}_{\perp}\bm{C}_{\perp}^{\top}, 𝒰:=col(𝐗)\mathcal{U}:=\mathrm{col}(\bm{X}), and 𝐏:=𝐏X\bm{P}:=\bm{P}_{X} as the orthogonal projector onto 𝒰\mathcal{U}. Lemma 1. If rank(𝐗I)=r\mathrm{rank}(\bm{X}_{I})=r and ‖𝐗I

Papers on Lattice

Total citations

Topics

h-index

Research focus

Architecture Design (Transformers, SSMs, MoE) (1)Inference & Quantization (1)Multimodal Models (1)

Frequent co-authors

Dorsa Zeinali (1)Hailing Wang (1)Yitian Zhang (1)

Papers (1)

Jan 28, 2026

CompSRT: Quantization and Pruning for Image Super Resolution Transformers

Hadamard transforms aren't magic for quantizing image super-resolution transformers, they just squash the dynamic range and push values towards zero, and CompSRT leverages this for SOTA compression.

Dorsa Zeinali, Hailing Wang, Yitian Zhang +1

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Multimodal Models

Search

Yun Fu

Research focus

Frequent co-authors

Papers (1)