Search papers, labs, and topics across Lattice.
11
1
15
28
Current benchmarks miss the point: the real value of AI peer review lies in the quality of its textual justification, not just predicting a rating.
Learned critics in RLHF can actually *increase* variance and hurt performance in sparse-reward settings, but a simple explained variance metric can tell you when to ditch the critic and get better results.
Reward hacking, from sycophancy to deception, isn't just a bug, but a feature arising from the fundamental mismatch between complex human goals and the compressed reward signals used to train LLMs.
Multi-turn reinforcement learning gets a boost: weighting trajectories by semantic similarity dramatically improves baseline estimation and agent performance in long-document visual QA.
You can dial up or down how obvious an AI's hallucinations are, giving you control over whether users catch the errors.
Even the best LLMs fail to follow complex constraints in tool use more than 50% of the time, revealing a critical weakness in real-world agent deployment.
Forget benchmarks: AI can now learn "scientific taste" and propose research ideas with higher potential impact than humans, thanks to a novel reinforcement learning approach using citation data.
Current LLMs fall short in understanding implicit intentions and modeling long-term user preferences, as revealed by a new benchmark, LifeSim-Eval, designed to simulate real-world user-assistant interactions.
RFT's impressive in-domain performance masks surprisingly weak generalization to new environments, highlighting a critical challenge for deploying LLM agents in the real world.
GPT-5's scientific reasoning skills plummet by nearly 50% when tackling multi-step workflows, revealing a critical gap in current LLM agents' ability to orchestrate complex tool use.
Finally, a fully open-source, reproducible system for long-form song generation is here, complete with licensed data, code, and a Qwen-based model that rivals closed-source systems.