Search papers, labs, and topics across Lattice.
K\hat{\mathbb{V}}=\{\hat{V_{k}}\}_{k=1}^{K}. Each component Vk^\hat{V_{k}} within this set is now harmonized in the text embedding dimension, making the entire hierarchical collection ready for effective integration into the LLM. Adaptive Gating Fusion (AGF). With the visual tokens aligned, the next step is their effective integration. Rather than using a simple summation, which can disrupt the LLM’s state, we propose an adaptive injection mechanism. This approach acknowledges that the LLM’s hidden state, hh, already contains contextual information and thus requires a more nuanced update. Accordingly, for each injection layer LtL_{t} in the LLM, a gating module is proposed to dynamically assess the relevance of the new visual features, V^k∈𝕍^\hat{V}_{k}\in\hat{\mathbb{V}}, based on the current decoding context hth_{t}. The gate’s logic is driven by cross-attention. Two learnable query vectors, qvq_{v} and qhq_{h}, act as probes to distill the essence of the new visual information and the existing hidden state: V^att\displaystyle\hat{V}_{\text{att}} =MultiHeadAttention(qv,V^k,V^k),\displaystyle=\text{MultiHeadAttention}(q_{v},\hat{V}_{k},\hat{V}_{k}), (6) hatt\displaystyle h_{\text{att}} =MultiHeadAttention(qh,ht,ht).\displaystyle=\text{MultiHeadAttention}(q_{h},h_{t},h_{t}). (7) The resulting context vectors are fused and passed through a gate controller—a linear layer followed by a Sigmoid activation—to yield a dynamic weight, W∈[0,1]W\in[0,1]: W\displaystyle W =Sigmoid(Gate([V^att;hatt])).\displaystyle=\text{Sigmoid}(\text{Gate}([\hat{V}_{\text{att}};h_{\text{att}}])). (8) This weight governs the selective update of the hidden state. To ensure precision, we use a binary “mask” that isolates the positions of visual tokens within hth_{t}. The non-visual portions are preserved, while the visual portions are updated via a weighted sum with the new features: ht′=ht⊙(1−mask)+(ht⊙mask+W⊙V^),h^{\prime}_{t}=h_{t}\odot(1-\text{mask})+(h_{t}\odot\text{mask}+W\odot\hat{V}), (9) where ⊙\odot denotes element-wise product. This fusion process enriches the hidden state hth_{t} with a hierarchical representation of the visual input through processing all the V^k\hat{V}_{k} in 𝕍^\hat{\mathbb{V}}. This gating and update cycle is repeated at designated injection points throughout the LLM’s decoder. This method facilitates an iterative refinement of the model’s visual understanding, allowing it to “re-examine” visual evidence at varying granularities throughout the generation process. 4 Experiments This section presents a comprehensive empirical validation of our Cross-Layer Injection (CLI) framework. We first detail the experimental setup, including our implementation of CLI on two distinct VLM architectures and the benchmarks used for evaluation. Then, the main results are presented, demonstrating that CLI consistently and significantly outperforms strong baselines and competing fusion strategies across 18 diverse benchmarks. Finally, a series of in-depth ablation studies and analyses are conducted to dissect the specific contributions of CLI’s core components and validate our “many-to-many” design philosophy. 4.1 Experiment Setup. Model Configuration. To demonstrate the versatility and general applicability of CLI, we integrate it into two distinct VLM architectures, LLaVA-OneVision [DBLP:journals/tmlr/
1
0
2
3
By explicitly considering feature observability, this UAV exploration framework achieves 30% higher coverage in feature-limited environments compared to traditional methods that ignore perception quality.