Search papers, labs, and topics across Lattice.
Training AI systems from human feedback using reinforcement learning, direct preference optimization, and reward modeling.
#21 of 24
5
Forget hand-crafted rewards: MotionVL uses VLMs and LLMs to automatically generate task-aligned reward functions for humanoid robot RL, leading to more human-like and robust motion.
Training LLMs to optimize for conflicting objectives between the final output and the reasoning process can significantly degrade the monitorability of Chain-of-Thought, making oversight more difficult.
Robots get a 33% speed boost and become significantly more adaptable when you let LLMs handle the reasoning and RL handle the movements.
Learned approval in MONA can eliminate reward hacking, but at the cost of significantly under-optimizing for the intended task, revealing a critical trade-off in safe RL.
Stop rewarding all LLM-generated candidates equally: ShapE-GRPO uses Shapley values to fairly distribute credit within sets, leading to better training and faster convergence.
Reward LLMs for verifiable reasoning steps, not just correct answers, to get more reliable multi-step logic.
Stop cobbling together memory-augmented agents: MemFactory offers a unified "Lego-like" framework that streamlines training and boosts performance by up to 14.8%.
Stochastic negative sampling in Direct Preference Optimization (DPO) dramatically improves multimodal sequential recommendation, suggesting that carefully curated "wrong" answers are key to preference learning.
Correcting errors early in the diffusion process matters more than fixing them later: Stepwise-Flow-GRPO leverages this insight to dramatically improve RL-based flow model training.
Unlock $\sqrt{N}$ regret in offline policy learning, even with complex policy classes, by trading off policy and environment complexity.
Stop handcuffing student diffusion models to their teachers: framing distribution matching as a reward unlocks more stable and performant distillation via RL techniques.
Forget hand-designed RL algorithms – LLMs can evolve competitive learners from scratch, even when forced to invent completely new update rules.
Stop assuming a single utility function: modeling preferences as a mixture of archetypes unlocks better Bayesian optimization in complex, many-objective spaces.
Even with corrupted human feedback, surprisingly tight guarantees for multi-agent reinforcement learning are possible.
LLMs can reason more accurately and concisely when RL is guided by token-level entropy, pinpointing and exploring "forks in the road" during the reasoning process.
Get 80% of your oracle feedback for free: ROVED leverages vision-language embeddings to drastically reduce the need for human preferences in reinforcement learning.
Forget hand-crafting prototypes for interpretable RL: this method learns them directly from the data, matching the performance of expert-designed systems.
LLMs can better adapt to diverse preferences by explicitly separating stable personal traits from situational factors, leading to significant performance gains, especially when preferences shift across episodes.
Reward hacking isn't a bug to fix, but an inevitable consequence of how we evaluate AI, and it gets exponentially worse as agents gain more tools.
Adversarial fine-tuning can now bypass Constitutional AI safety measures with almost no performance penalty, enabling models to provide detailed instructions on dangerous topics like CBRN warfare.
Robots can now learn complex manipulation tasks from scratch using only video and language, bypassing the need for hand-engineered reward functions, demonstrations, or even task-specific tuning.
Claude's Constitution doesn't create a neutral AI, but instead bakes in the values of Northern European and Anglophone cultures, creating a value floor that's hard to shift.
Even state-of-the-art LLMs like GPT-4o and Claude 3.5 still exhibit varying degrees of sycophancy depending on the input language, revealing persistent cultural and linguistic biases.
Over-refusal isn't just a misapplication of a global "no" switch; it's deeply intertwined with how LLMs represent and execute specific tasks.
Forget auxiliary encoders and handcrafted losses: LVRPO uses reinforcement learning to directly align language and vision, boosting performance across a range of multimodal tasks.
Unlock better hardware designs: RTLSeek's diversity-oriented RL lets LLMs explore a wider range of Verilog implementations, boosting both correctness and design options.
Agentic coding models can achieve near-SOTA performance by specializing in distinct coding domains before unifying them via on-policy distillation.
Training domain-specific coding LLMs with realistic environments and large-scale RL can yield substantial gains in practical software engineering tasks.
LMs can learn to generate multiple plausible answers in a single forward pass, outperforming traditional single-answer models on tasks requiring distributional reasoning and offering a compute-efficient alternative to best-of-k sampling.
LLMs can reason through chains of thought 2.5x longer and achieve 8% higher accuracy on complex math problems by optimizing for token-level influence on future trajectory behavior.
Injecting demonstrations with a carefully annealed probability can drastically improve exploration in RLVR, even for tasks requiring novel reasoning or domain-specific knowledge.
Observational user feedback, often dismissed as too noisy and biased, can actually power effective RLHF with the right causal modeling, achieving a 49.2% gain on WildGuardMix.
Scale up offline policy training for diffusion LLMs without breaking the bank: dTRPO slashes trajectory computation costs while boosting performance up to 9.6% on STEM tasks.
Forget prompt engineering – LSE trains LLMs to self-edit their own contexts at test time, outperforming even GPT-5 and Claude Sonnet 4.5 in Text-to-SQL and question answering.
Forget random data mixing: MOSAIC uses failure analysis to intelligently curate training data, leading to better safety, less over-refusal, and improved instruction following, all at once.
Unleashing an LLM's inner creativity or laser-sharp logic is now as simple as turning a knob, thanks to a new distribution-matching method that avoids heuristic rewards.
LLMs surprisingly prioritize norm adherence over personal incentives in business scenarios, challenging assumptions about goal-driven behavior.
Training multi-turn LLM agents just got easier: ProRL Agent offers a scalable, API-driven rollout service that streamlines RL training across diverse tasks.
LLM post-training pipelines can be configured with 10x less compute using AutoPipe, a budget-aware framework that learns from historical runs and predicts performance from early training signals.
Human oversight can be systematically integrated into LLM-based text generation to improve accessibility, creating a traceable and auditable process.
Achieve significant reasoning gains in frozen LLMs (+22.4%) without retraining by adaptively routing reward model guidance at the token level during inference.
Forget fixed decoding strategies – RL can learn a lightweight policy to adapt LLM sampling *at test time*, boosting summarization quality by up to 88% without retraining the LLM.
Learning from ranked preferences alone can be surprisingly difficult: even with access to the full ranking of actions, standard online learning guarantees break down unless the environment is sufficiently stable.
Forget static models: this adaptive framework slashes stock price prediction error by dynamically routing data through specialized pathways based on real-time market regime detection.
Aligning diffusion models with just 100 carefully selected samples can beat state-of-the-art preference optimization methods trained on thousands, and converge up to 220x faster.
Stripping away the complexity of GRPO reveals that simple REINFORCE with group relative advantage can actually *improve* LLM reasoning, challenging the assumption that sophisticated loss functions are always better.
Human-AI teams often fail not because AI is inaccurate, but because humans miscalibrate their reliance on it, highlighting the need for readiness metrics beyond accuracy.
Greedy off-policy learning, optimal in theory, can fail spectacularly when supplies are limited, but a simple fix—prioritizing items with high *relative* reward—can restore performance.
Low-resource language models can get a major boost in translation quality and tokenization efficiency by using reinforcement learning to directly enforce structural constraints like sequence length and linguistic well-formedness during training.
Achieve topologically coherent coronary vessel segmentation by directly optimizing for geometric structure, rather than pixel-wise accuracy, using preference-based learning.