Search papers, labs, and topics across Lattice.
Shanghai Jiao Tong University
7
0
7
SEGA3D achieves an impressive 8.3 mIoU improvement over previous methods, redefining the standards for 3D vision-language segmentation.
Spatial intelligence in MLLMs can be dramatically enhanced without any architectural modifications or retraining, thanks to a novel collaborative cognitive mapping approach.
Current video MLLMs struggle to grasp fleeting visual events, with top models barely surpassing 39% accuracy on critical momentary tasks.
LLM-powered agents can now produce surprisingly strong photographs in complex 3D environments, suggesting a path towards embodied AI with aesthetic awareness.
Visual degradations can cripple the spatial reasoning abilities of even state-of-the-art MLLMs, but targeted finetuning can restore鈥攁nd even surpass鈥攈uman-level performance.
LVLMs can achieve SOTA visual reasoning by learning to "see" in a way that optimizes for reasoning, even if it means deviating from strict geometric accuracy.
Small, open-source LLMs can now outperform larger, closed-source models in complex industrial design tasks by learning to orchestrate CAD/CAE tools within a reinforcement learning framework.