Search papers, labs, and topics across Lattice.
3
0
5
Vision Mamba's ImageNet accuracy jumps to 83.5% thanks to a simple trick: adding separator tokens to enable pretraining on 4x longer sequences.
Instruction-guided visual modulation with iGVLM unlocks more fine-grained reasoning in LVLMs, outperforming static vision encoders by dynamically adapting visual representations to the specific textual task.
Image-text models can achieve superior performance by fusing modalities during training only, then discarding the fusion module at inference for efficiency.