Search papers, labs, and topics across Lattice.
29 papers published across 4 labs.
Forget hand-crafted rewards: MotionVL uses VLMs and LLMs to automatically generate task-aligned reward functions for humanoid robot RL, leading to more human-like and robust motion.
Training LLMs to optimize for conflicting objectives between the final output and the reasoning process can significantly degrade the monitorability of Chain-of-Thought, making oversight more difficult.
Robots get a 33% speed boost and become significantly more adaptable when you let LLMs handle the reasoning and RL handle the movements.
Learned approval in MONA can eliminate reward hacking, but at the cost of significantly under-optimizing for the intended task, revealing a critical trade-off in safe RL.
Stop rewarding all LLM-generated candidates equally: ShapE-GRPO uses Shapley values to fairly distribute credit within sets, leading to better training and faster convergence.
Forget hand-crafted rewards: MotionVL uses VLMs and LLMs to automatically generate task-aligned reward functions for humanoid robot RL, leading to more human-like and robust motion.
Training LLMs to optimize for conflicting objectives between the final output and the reasoning process can significantly degrade the monitorability of Chain-of-Thought, making oversight more difficult.
Robots get a 33% speed boost and become significantly more adaptable when you let LLMs handle the reasoning and RL handle the movements.
Learned approval in MONA can eliminate reward hacking, but at the cost of significantly under-optimizing for the intended task, revealing a critical trade-off in safe RL.
Stop rewarding all LLM-generated candidates equally: ShapE-GRPO uses Shapley values to fairly distribute credit within sets, leading to better training and faster convergence.
Reward LLMs for verifiable reasoning steps, not just correct answers, to get more reliable multi-step logic.
Stop cobbling together memory-augmented agents: MemFactory offers a unified "Lego-like" framework that streamlines training and boosts performance by up to 14.8%.
Stochastic negative sampling in Direct Preference Optimization (DPO) dramatically improves multimodal sequential recommendation, suggesting that carefully curated "wrong" answers are key to preference learning.
Correcting errors early in the diffusion process matters more than fixing them later: Stepwise-Flow-GRPO leverages this insight to dramatically improve RL-based flow model training.
Unlock $\sqrt{N}$ regret in offline policy learning, even with complex policy classes, by trading off policy and environment complexity.
Stop handcuffing student diffusion models to their teachers: framing distribution matching as a reward unlocks more stable and performant distillation via RL techniques.
Forget hand-designed RL algorithms – LLMs can evolve competitive learners from scratch, even when forced to invent completely new update rules.
Stop assuming a single utility function: modeling preferences as a mixture of archetypes unlocks better Bayesian optimization in complex, many-objective spaces.
Even with corrupted human feedback, surprisingly tight guarantees for multi-agent reinforcement learning are possible.
LLMs can reason more accurately and concisely when RL is guided by token-level entropy, pinpointing and exploring "forks in the road" during the reasoning process.
Get 80% of your oracle feedback for free: ROVED leverages vision-language embeddings to drastically reduce the need for human preferences in reinforcement learning.
Forget hand-crafting prototypes for interpretable RL: this method learns them directly from the data, matching the performance of expert-designed systems.
LLMs can better adapt to diverse preferences by explicitly separating stable personal traits from situational factors, leading to significant performance gains, especially when preferences shift across episodes.
Reward hacking isn't a bug to fix, but an inevitable consequence of how we evaluate AI, and it gets exponentially worse as agents gain more tools.
Adversarial fine-tuning can now bypass Constitutional AI safety measures with almost no performance penalty, enabling models to provide detailed instructions on dangerous topics like CBRN warfare.
Robots can now learn complex manipulation tasks from scratch using only video and language, bypassing the need for hand-engineered reward functions, demonstrations, or even task-specific tuning.
Claude's Constitution doesn't create a neutral AI, but instead bakes in the values of Northern European and Anglophone cultures, creating a value floor that's hard to shift.
Even state-of-the-art LLMs like GPT-4o and Claude 3.5 still exhibit varying degrees of sycophancy depending on the input language, revealing persistent cultural and linguistic biases.
Over-refusal isn't just a misapplication of a global "no" switch; it's deeply intertwined with how LLMs represent and execute specific tasks.
Forget auxiliary encoders and handcrafted losses: LVRPO uses reinforcement learning to directly align language and vision, boosting performance across a range of multimodal tasks.
Unlock better hardware designs: RTLSeek's diversity-oriented RL lets LLMs explore a wider range of Verilog implementations, boosting both correctness and design options.
Agentic coding models can achieve near-SOTA performance by specializing in distinct coding domains before unifying them via on-policy distillation.
Training domain-specific coding LLMs with realistic environments and large-scale RL can yield substantial gains in practical software engineering tasks.
LMs can learn to generate multiple plausible answers in a single forward pass, outperforming traditional single-answer models on tasks requiring distributional reasoning and offering a compute-efficient alternative to best-of-k sampling.