Search papers, labs, and topics across Lattice.
3
0
6
RL fine-tuning LMMs for tool use can collapse structural formats due to strong pretrained tool priors, but a surprisingly simple fix of targeted format rewards and frame-budget randomization can restore stability and boost performance.
Today's visual generation models are often evaluated on the wrong things, leading to inflated performance claims that mask critical failures in spatial reasoning, temporal consistency, and causal understanding.
Current research agent benchmarks miss critical flaws, as MiroEval reveals that process quality is a reliable predictor of research outcome, and multimodal tasks expose weaknesses invisible to output-level metrics.