Search papers, labs, and topics across Lattice.
4 papers published across 1 lab.
Forget expensive downstream evaluations: token-level statistics from expert-written solutions can reliably forecast LLM performance with 10,000x less compute.
Current video understanding models struggle with long-horizon robustness and non-speech audio, as revealed by the new OmniPro benchmark designed for comprehensive omni-modal proactive evaluation.
LLMs can now be rigorously tested on their ability to generate correct chip design rule checking (DRC) scripts, thanks to a new benchmark that scores scripts based on execution, not just code similarity.
Current personal assistant agents struggle to anticipate and act on unstated user needs in long, complex workflows, revealing a critical gap between task completion and genuine proactivity.