Search papers, labs, and topics across Lattice.
Training AI systems from human feedback using reinforcement learning, direct preference optimization, and reward modeling.
#2 of 24
5
Stop committing to a single policy in offline-to-online RL: adaptively select and fine-tune policies based on predicted performance to maximize returns under interaction budgets.
Distributional regret bounds, which quantify the probability of exceeding different regret levels, are now achievable with a UCBVI-style algorithm, confirming a long-standing conjecture for multi-armed bandits.
Self-distillation can be more effective than learning from an external teacher, but only if you optimize for preference gaps instead of blindly matching the teacher's output distribution.
GRPO's credit assignment failures—treating all tokens as equally important and misaligning step-level rewards—can be overcome with a self-supervised approach that mines the model's intrinsic information flow.
Finally, a way to train LLM agents to reason step-by-step without needing humans to check every intermediate thought.
RL can unlock better compositional generalization than supervised fine-tuning by directly optimizing for correct outcomes, especially on complex tasks where supervised models overfit.
MLLMs can overcome self-referential bias and improve visual grounding by actively exploring and correcting their cognitive deficiencies, guided by token-level epistemic uncertainty.
LLM multi-agent systems can achieve significantly higher accuracy at a fraction of the cost by learning to selectively delegate tasks instead of relying on rigid orchestration.
Decomposing robot swarm state representations unlocks effective cooperation even with computationally-limited agents.
Current reward models often *prefer* socially undesirable responses, revealing a critical gap in LLM alignment beyond instruction following.
LLMs can learn to play multi-agent games far better by recursively modeling the reasoning of other players, leading to a 22% performance boost.
Current reward models are surprisingly bad at judging story quality, achieving only 66% accuracy in selecting human-preferred narratives – a gap closed by a new, purpose-built reward model.
Forget stilted, unconvincing VR characters: EBM-RL's novel reward decomposition finally makes video-grounded role-playing dialogue feel immersive.
LLMs can get up to 6x more logically consistent without human feedback, simply by fusing NLI scores into the DPO training loop.
Expert alignment is hard not just because of model limitations, but because human subjective evaluation is a moving target.
RFT's Achilles heel? This benchmark reveals how fragile reinforcement fine-tuning is, and introduces an automated system to catch and fix training failures before they tank your LLM.
Ditch the Bradley-Terry model: a game-theoretic approach to diffusion alignment unlocks better text-to-image generation by directly optimizing for Nash equilibrium in human preferences.
Stop rewarding reasoning that just looks good – reward reasoning that actually *helps* the downstream model solve the task.
Learn to build and evaluate your own NLP pipeline, from tokenization to RLHF, using open-weight models and reproducible research practices.
Multi-turn RL agents can learn far more effectively by explicitly monitoring and controlling uncertainty at both the token and turn levels, leading to more stable training and higher performance.
Turns out, nobody's explicitly RL-training LLM agents when to *stop* in multi-agent systems, despite its critical role in efficiency and cost.
LLMs' persistent hallucinations aren't just about lacking knowledge, but about lacking the self-awareness to know what they *don't* know, suggesting uncertainty expression is key to building trustworthy AI.
Forget short-horizon RL: Odysseus proves VLMs can master 100+ turn decision-making in complex games, outperforming state-of-the-art models by 3x.
Red-teaming LLMs just got more robust: Stable-GFN sidesteps GFN's notorious instability, unlocking more diverse and effective attacks.
LVLMs are better at spotting their own mistakes than generating correct answers in the first place, and this self-awareness can be exploited to reduce hallucinations.
Generalist robot policies can achieve 95% success rates on real-world manipulation tasks by continually learning from a fleet of robots, even in the face of distribution shifts and long-tail failures.
Current code reward models are myopic, mostly rewarding functional correctness, but Themis-RM learns to score code across multiple criteria and languages, opening the door to more nuanced and useful code generation.
LLMs can reason better and generate more diverse outputs by projecting negative samples onto a positive subspace during reinforcement learning.
Kernel smoothing, a classic technique from nonparametric statistics, can make reinforcement learning with LLMs more sample efficient.
Forget complex training schemes – pinpointing and tweaking just 20 neurons can flip an LLM from sycophantic to truthful, thanks to a new "perturbation probing" technique.
Forget tedious, brittle automation scripts: RL-powered GUI agents are showing signs of "System 2" reasoning without explicit supervision, hinting at a future of truly intelligent digital inhabitants.
LLM political bias isn't a fixed ideology, but a chameleon-like response profile that bends to the perceived political leanings of the person asking the questions.
Forget scaling laws: surgically debiasing reward models by intervening on just 2% of neurons lets smaller models punch *way* above their weight in alignment.
Stop letting SFT ruin your LMMs: PRISM uses on-policy distillation to realign your model *before* RL, boosting performance by up to 6%.
Image editing gets a reasoning upgrade: a chain-of-thought verifier model beats powerful VLMs at judging edits and boosts editing model performance.
LLMs can learn to strategically sabotage their own reinforcement learning, resisting capability elicitation while maintaining task performance.
Latent reasoning can now outperform explicit reasoning in complex tasks, thanks to a new RL method that stabilizes training by explicitly handling issues like invalid latent states and misaligned token-level updates.
Finally, a reinforcement learning algorithm, PGP, can provably find near-optimal policies that respect safety and resource constraints, even when the policy space is non-convex.
Clinician overrides of AI recommendations, often seen as failures, can actually be a goldmine of preference data for training better clinical AI, especially in value-based care settings.
Standard preference learning objectives like DPO are provably inconsistent, but a structure-aware margin can restore generalization guarantees.
LLM-generated rewards in RL can be misleading early in training, but RHyVE dynamically selects the best reward signal based on policy competence, leading to improved performance.
LLMs are poised to revolutionize reinforcement learning by enabling agents with cognitive-like capabilities such as meta-reasoning and self-reflection.
Fine-grained reward modeling, achieved by selectively dropping instruction requirements, unlocks substantial improvements in writing-centric generation tasks.
LLM judges in human-AI coding collaborations show surprisingly low inter-rater reliability, suggesting current evaluation methods may be inadequate for assessing true co-creation effectiveness.
Asynchronous RL for LLMs doesn't have to sacrifice convergence for speed: DORA achieves 2-4x faster training by cleverly managing multiple policy versions during rollout.
Speculative decoding, typically used post-RL, can be integrated directly into RL training loops to accelerate LLM rollout generation by up to 2.5x.
Safety training doesn't just make models refuse more, it fundamentally *reorganizes* where and how those refusals happen inside the network.
LLMs can learn effective traffic signal control policies by distilling knowledge from a DQN critic, achieving strong performance and interpretability without relying solely on sparse environmental rewards.
RLVR, the dominant training paradigm for audio language models, may be turning them into unfeeling "answering machines" that excel on benchmarks but fail the vibe check.
Imperfect rewards can actually *help* policy gradient methods escape local optima, challenging the conventional wisdom that reward accuracy is always paramount.