Search papers, labs, and topics across Lattice.
Northeastern University, City University of Hong Kong
4
0
6
4
Forget end-to-end training: teaching models to master individual visual tools *before* tackling complex reasoning unlocks surprisingly strong performance.
Current visual grounding models struggle to infer objects from contextual roles and intentions, highlighting a critical gap in their ability to perform true scene understanding.
MLLMs that ace standard Referring Expression Comprehension benchmarks still stumble when faced with images designed to eliminate shortcuts, revealing a surprising lack of robust visual reasoning.
Bridging the gap between open-source and enterprise T2I models, Fine-T2I offers a massive, meticulously curated dataset that unlocks significant gains in generation quality and instruction following through fine-tuning.