Search papers, labs, and topics across Lattice.
University of Texas at San Antonio
3
0
6
LLMs ace semantic similarity in medical QA, but VB-Score reveals they're failing to extract key medical entities, especially when answering questions about chronic conditions affecting older and minority populations.
Current research agents still struggle with retrieval robustness and hallucination control, even when evaluated in a static, verifiable research environment.
Current XAI evaluations can be fooled: this new metric reveals that even small input variations can cause explanations to drastically change, undermining trust in pattern recognition systems.