Search papers, labs, and topics across Lattice.
28 papers published across 2 labs.
Stop committing to a single policy in offline-to-online RL: adaptively select and fine-tune policies based on predicted performance to maximize returns under interaction budgets.
Distributional regret bounds, which quantify the probability of exceeding different regret levels, are now achievable with a UCBVI-style algorithm, confirming a long-standing conjecture for multi-armed bandits.
Self-distillation can be more effective than learning from an external teacher, but only if you optimize for preference gaps instead of blindly matching the teacher's output distribution.
GRPO's credit assignment failures—treating all tokens as equally important and misaligning step-level rewards—can be overcome with a self-supervised approach that mines the model's intrinsic information flow.
Finally, a way to train LLM agents to reason step-by-step without needing humans to check every intermediate thought.
Stop committing to a single policy in offline-to-online RL: adaptively select and fine-tune policies based on predicted performance to maximize returns under interaction budgets.
Distributional regret bounds, which quantify the probability of exceeding different regret levels, are now achievable with a UCBVI-style algorithm, confirming a long-standing conjecture for multi-armed bandits.
Self-distillation can be more effective than learning from an external teacher, but only if you optimize for preference gaps instead of blindly matching the teacher's output distribution.
GRPO's credit assignment failures—treating all tokens as equally important and misaligning step-level rewards—can be overcome with a self-supervised approach that mines the model's intrinsic information flow.
Finally, a way to train LLM agents to reason step-by-step without needing humans to check every intermediate thought.
RL can unlock better compositional generalization than supervised fine-tuning by directly optimizing for correct outcomes, especially on complex tasks where supervised models overfit.
MLLMs can overcome self-referential bias and improve visual grounding by actively exploring and correcting their cognitive deficiencies, guided by token-level epistemic uncertainty.
LLM multi-agent systems can achieve significantly higher accuracy at a fraction of the cost by learning to selectively delegate tasks instead of relying on rigid orchestration.
Decomposing robot swarm state representations unlocks effective cooperation even with computationally-limited agents.
Current reward models often *prefer* socially undesirable responses, revealing a critical gap in LLM alignment beyond instruction following.
LLMs can learn to play multi-agent games far better by recursively modeling the reasoning of other players, leading to a 22% performance boost.
Current reward models are surprisingly bad at judging story quality, achieving only 66% accuracy in selecting human-preferred narratives – a gap closed by a new, purpose-built reward model.
Forget stilted, unconvincing VR characters: EBM-RL's novel reward decomposition finally makes video-grounded role-playing dialogue feel immersive.
LLMs can get up to 6x more logically consistent without human feedback, simply by fusing NLI scores into the DPO training loop.
Expert alignment is hard not just because of model limitations, but because human subjective evaluation is a moving target.
RFT's Achilles heel? This benchmark reveals how fragile reinforcement fine-tuning is, and introduces an automated system to catch and fix training failures before they tank your LLM.
Ditch the Bradley-Terry model: a game-theoretic approach to diffusion alignment unlocks better text-to-image generation by directly optimizing for Nash equilibrium in human preferences.
Stop rewarding reasoning that just looks good – reward reasoning that actually *helps* the downstream model solve the task.
Learn to build and evaluate your own NLP pipeline, from tokenization to RLHF, using open-weight models and reproducible research practices.
Multi-turn RL agents can learn far more effectively by explicitly monitoring and controlling uncertainty at both the token and turn levels, leading to more stable training and higher performance.
Turns out, nobody's explicitly RL-training LLM agents when to *stop* in multi-agent systems, despite its critical role in efficiency and cost.
LLMs' persistent hallucinations aren't just about lacking knowledge, but about lacking the self-awareness to know what they *don't* know, suggesting uncertainty expression is key to building trustworthy AI.
Forget short-horizon RL: Odysseus proves VLMs can master 100+ turn decision-making in complex games, outperforming state-of-the-art models by 3x.
Red-teaming LLMs just got more robust: Stable-GFN sidesteps GFN's notorious instability, unlocking more diverse and effective attacks.
LVLMs are better at spotting their own mistakes than generating correct answers in the first place, and this self-awareness can be exploited to reduce hallucinations.
Generalist robot policies can achieve 95% success rates on real-world manipulation tasks by continually learning from a fleet of robots, even in the face of distribution shifts and long-tail failures.
Current code reward models are myopic, mostly rewarding functional correctness, but Themis-RM learns to score code across multiple criteria and languages, opening the door to more nuanced and useful code generation.
LLMs can reason better and generate more diverse outputs by projecting negative samples onto a positive subspace during reinforcement learning.