Search papers, labs, and topics across Lattice.
4
0
6
Open-weight Omni models struggle with binding accuracy, achieving only 41.55% on a new counterfactual benchmark, highlighting a critical gap in long-video comprehension.
LLMs ace semantic similarity in medical QA, but VB-Score reveals they're failing to extract key medical entities, especially when answering questions about chronic conditions affecting older and minority populations.
Current research agents still struggle with retrieval robustness and hallucination control, even when evaluated in a static, verifiable research environment.
Current XAI evaluations can be fooled: this new metric reveals that even small input variations can cause explanations to drastically change, undermining trust in pattern recognition systems.