Search papers, labs, and topics across Lattice.
2
0
4
0
Even top LLM judges struggle to reliably detect violations of specific constraints in complex instructions, especially when violations are partial or absent, revealing critical blind spots in current evaluation methods.
Over half of video understanding benchmark samples are solvable without watching the video, and current models barely outperform random guessing on the rest.