Search papers, labs, and topics across Lattice.
BITS Pilani, India
5
0
7
2
LLM judges can be subtly manipulated by framing the consequences of their decisions, leading to biased evaluations even when the content being judged remains constant.
LLM judges are far less reliable on individual examples than aggregate metrics suggest: up to 67% of documents show judgment inconsistencies, and some criteria like fluency are essentially unjudgeable.
LLMs hit a hard wall in algebraic reasoning, choking on problems with just 20-30 parallel branches regardless of model size, suggesting an architectural bottleneck, not just a capacity issue.
LLMs struggle to master even simple board games like Ludo, agreeing with optimal game-theory strategies less than half the time and exhibiting inconsistent behavior based on prompt framing.
LLMs can reliably judge the correctness of time series explanations, even when their own explanations are wrong, opening the door to reference-free evaluation.