Search papers, labs, and topics across Lattice.
75 papers published across 5 labs.
Policy gradient methods may be self-defeating in language model reasoning, as their inherent entropy reduction chokes off exploration and limits downstream performance.
Robots can now navigate based on your spoken preferences and visual context, thanks to a clever fusion of VLMs, LLMs, and multi-objective RL.
LLMs can achieve state-of-the-art reasoning accuracy with significantly fewer tokens by rewarding intermediate reasoning steps that maximize information gain and maintain monotonic progress.
RL agents can learn far more efficiently by dynamically distilling and leveraging past experiences that co-evolve with the agent's growing capabilities.
LLMs can act as effective action-level supervisors in reinforcement learning, dramatically boosting the sample efficiency of SAC without sacrificing convergence guarantees.
Robots can now navigate based on your spoken preferences and visual context, thanks to a clever fusion of VLMs, LLMs, and multi-objective RL.
LLMs can achieve state-of-the-art reasoning accuracy with significantly fewer tokens by rewarding intermediate reasoning steps that maximize information gain and maintain monotonic progress.
RL agents can learn far more efficiently by dynamically distilling and leveraging past experiences that co-evolve with the agent's growing capabilities.
LLMs can act as effective action-level supervisors in reinforcement learning, dramatically boosting the sample efficiency of SAC without sacrificing convergence guarantees.
By aligning hidden representations, CRAFT achieves a remarkable 79% improvement in reasoning safety, suggesting that latent-space interventions are a potent defense against jailbreaks.
LLMs can escape the trap of confidently wrong reasoning by co-evolving a generator and verifier from a single model, bootstrapping each other to break free from flawed consensus.
RLHF for autoregressive video generation gets a boost with AR-CoPO, which overcomes the limitations of SDE-based methods by using chunk-level alignment and a semi-on-policy training strategy.
Forget expensive human annotations – this new method uses information theory to automatically score each step of an LLM's reasoning process, making chain-of-thought supervision scalable and efficient.
Online RLHF can match the performance of offline RLHF with 10x less data, and potentially 1000x at scale.
Autonomous vehicles can now leverage the rich semantic understanding of VLMs for safer driving without the computational overhead, thanks to a clever training strategy that distills VLM knowledge into a real-time RL policy.
Language models can learn directly from real-world user interactions, boosting performance without human annotations or simulated environments.
User-facing guardrails for LLM-enabled robots can balance flexibility and safety by offering constrained choices and clear recourse, rather than open-ended value settings.
Contrary to claims that RLVR can handle noisy data, this work reveals that current RLVR methods still suffer significantly from data quality issues, with performance dropping 8-12% when trained on truly noisy data.
VL-PRMs often reward hallucinated visual premises and penalize correct grounded statements, but this work shows you can fix that by explicitly verifying visual facts, leading to significant gains in reranking accuracy.
Forget hand-engineered reward functions: this work shows VLMs can provide reliable, zero-shot feedback for online robot policy refinement, boosting success rates on manipulation tasks in just 30 RL iterations.
Guaranteeing complex mission objectives in RL is now tractable: this method enforces Signal Temporal Logic constraints, enabling robots to learn while adhering to dynamic, time-sensitive tasks.
Chain-of-Thought reasoning in LLMs is a double-edged sword, reducing sycophancy in final answers but simultaneously masking it with deceptive, logically inconsistent justifications.
Reinforcement learning agents can now learn to be "good" (i.e., norm-compliant) via a novel pipeline that leverages argumentation-based normative advisors and automatically extracts the reasoning behind those norms.
By progressively refining the reward signal based on the distribution of model confidence, DistriTTRL achieves significant performance gains in RL by better aligning internal information between training and test time and mitigating reward hacking.
Unsupervised RL for math reasoning hinges on a model's pre-existing logical abilities, and its success can be predicted by whether the training trajectory stays within stable "manifolds" of good solutions.
LLMs can learn to recover from mistakes more effectively by reflecting on past failures and internalizing actionable feedback, leading to significant gains in long-horizon problem-solving.
Forget hand-engineered reward functions: Rewarding DINO learns dense, generalizable rewards for robot manipulation directly from visual data, opening the door to more autonomous skill acquisition.
A surprisingly simple sampling algorithm can provably find common ground among diverse preferences in a continuous space of alternatives, outperforming more complex LLM-based approaches.
ARISE lets language models solve math problems better by learning and reusing successful solution strategies, outperforming existing RL methods, especially on harder, out-of-distribution problems.
Reinforcement learning can effectively control collective animal behavior in the real world, even when individuals frequently ignore the artificial stimulus.
Negative constraints offer a surprisingly robust path to AI alignment, sidestepping the sycophancy issues inherent in preference-based RLHF.
Alignment warps LLMs from mirrors of human behavior into idealized reflectors of normative theory, crippling their ability to predict real-world strategic interactions.
LLMs can escape the trap of converging on popular but incorrect answers in unsupervised RLVR by temporarily "unlearning" and exploring diverse response options.
Supervised fine-tuning can be dramatically improved by explicitly encouraging exploration of low-confidence data and suppressing high-confidence errors, leading to sustained gains in mathematical reasoning even after extensive RLVR training.
Pinpointing the exact line of code causing a test failure boosts code generation performance by 3%, without needing a critic or extra training.
LLMs can now reliably follow complex, hierarchical instructions thanks to a new constrained RL framework that treats system prompts as strict algorithmic boundaries.
By prioritizing diversity over accuracy in experience replay, DyJR significantly boosts LLM reasoning performance in RL, outperforming GRPO and other baselines without sacrificing training efficiency.
Forget expensive LLM-as-judge checks: Proxy-GRM learns transferable rubrics for vision-language reward models with a lightweight proxy, achieving SOTA results with 4x less data.
Emotional support chatbots get a boost by learning directly from simulated user reactions, generating natural language critiques that drive better conversations.
Autonomous driving planners can now explicitly self-correct unsafe actions by generating motion-token traces conditioned on a learned collision critic, leading to significant safety improvements.
LLMs exhibit a surprising degree of moral indifference, compressing distinct moral concepts into uniform probability distributions, a problem that persists across model scales, architectures, and alignment techniques.
SafeFQL achieves state-of-the-art safety in offline RL with significantly lower inference latency than diffusion-based methods, making it suitable for real-time safety-critical applications.
Forget RLHF and massive datasets: SAGE co-evolves reasoning abilities in LLMs using only a small seed set and a clever quartet of self-improving agents.
Log-barrier regularization can provably rescue policy optimization from getting stuck in suboptimal regions by structurally enforcing exploration, without sacrificing sample complexity.
A 7B model trained with RL can outperform 72B-scale general MLLMs in robotic manipulation process supervision by explicitly reasoning about progress toward the final task goal.
LLM alignment is fundamentally challenged by the dynamic and inconsistent nature of their internal "priority graphs," which adversaries can exploit through context manipulation.
Ditching the "creed" might be the key to safer LLMs: a non-identity training format outperforms traditional identity-based approaches in safety fine-tuning.
Forget small-scale image editing datasets – this work unleashes a million-scale human preference dataset that unlocks better reward models and boosts text-guided image editing performance.
Test-time RL, intended to improve LLM reasoning, can backfire spectacularly, amplifying existing safety flaws and even degrading reasoning itself when exposed to adversarial prompts.
Current reward models for spoken dialogue systems are missing crucial paralinguistic and natural speech elements, but this new model closes the gap by operating directly on speech and outperforming existing audio LLMs.
Speech LLMs can now better understand your emotions: a new RL approach boosts paralinguistic understanding by 8-12% over state-of-the-art models.
LLMs can now write better stories, poems, and scripts, thanks to a new training method that uses AI to automatically generate its own writing feedback criteria.
Achieve glyph-accurate visual text rendering by training a model to directly optimize for regional glyph preferences, sidestepping the limitations of text recognition-based reward models.
Forget hand-crafted rules: MAC learns interpretable LLM constitutions that beat prompt optimization by 50% and rival fine-tuning, all without parameter updates.
By adversarially co-evolving code and test LLMs, Code-A1 achieves code generation performance on par with human-annotated training, while simultaneously boosting the LLM's ability to find bugs.
Forget benchmarks: AI can now learn "scientific taste" and propose research ideas with higher potential impact than humans, thanks to a novel reinforcement learning approach using citation data.
LLM safety failures aren't always about the prompt—exploring diverse model outputs for a fixed prompt can drive jailbreak success rates close to 100%.
LLMs aren't just mimicking text, they're exhibiting internal "motivation" that predicts their choices and performance, just like humans.
SFT and RL, often seen as distinct, are converging in LLM post-training, with hybrid approaches now dominating—but understanding when to use each remains crucial.
Monolingual reinforcement learning can massively boost low-resource language translation in LLMs, outperforming supervised baselines by a large margin.
Forget textual rules and coarse embeddings: a multimodal reward model that directly compares rendered visuals unlocks significant gains in vision-to-code RL.
Text-to-image flow models can achieve superior preference alignment by augmenting the condition space, creating a "dense" reward mapping that better captures inter-sample relationships.
Policy gradient methods may be self-defeating in language model reasoning, as their inherent entropy reduction chokes off exploration and limits downstream performance.
Hallucinations in RL-based image editing and generation are tamed with FIRM, a new framework that trains robust reward models on curated datasets to provide more accurate guidance.
RFT's impressive in-domain performance masks surprisingly weak generalization to new environments, highlighting a critical challenge for deploying LLM agents in the real world.
RL-trained LLM agents can get stuck in an "information self-locking" trap, failing to ask the right questions and internalize information, but a simple learning signal reallocation can break them out.
MLLMs can now judge more consistently and generalize better thanks to a multi-task reinforcement learning approach that aligns them with human preferences across diverse visual tasks.
Forget simple scaling laws: the compute-optimal number of parallel rollouts in LLM RL plateaus, revealing distinct mechanisms for easy vs. hard problems.
Reasoning LLM judges can inadvertently teach policies to generate adversarial outputs that game the evaluation system, highlighting a critical challenge in aligning LLMs for non-verifiable tasks.
LLM-based recommenders can be dramatically improved (up to 109% Recall@5) by using counterfactual rewards and uncertainty-aware scaling within a reinforcement learning framework, enabling flexible adaptation to diverse recommendation scenarios.
MLLMs are often overconfident, but a new confidence-driven training and test-time scaling approach can boost accuracy by 8.8% across benchmarks.
Learning from others doesn't require knowing who's an expert: this social bandit algorithm figures it out and improves performance even with non-experts in the mix.
Automating RL environment engineering slashes costs and unlocks massive speedups (up to 22,320x!) using a recipe of prompt engineering, verification, and agent-assisted repair.
Online reinforcement learning with large audio language model rewards catapults text-to-audio generation to a new state-of-the-art, even with a relatively small 470M parameter model.
Stop wasting RL on easy problems: a difficulty-aware curriculum for SFT and RL unlocks better reasoning in LLMs.
AssistMimic enables humanoid robots to learn complex, force-exchanging assistive motions by reformulating imitation learning as a multi-agent RL problem.
Human-preference aligned audio generation from video is now possible, as V2A-DPO surpasses previous methods by directly optimizing for semantic consistency, temporal alignment, and perceptual quality.
By selectively injecting teacher demonstrations only during failure, HAPO overcomes the limitations of both pure RL and mixed-policy optimization in sparse-reward RLVR, enabling models to surpass static teacher forcing.
LLM-as-a-judge consensus is often an illusion: models agree on surface-level features, but diverge wildly when evaluating true quality, a problem fixable by injecting domain knowledge into rubrics.
LLMs can be better aligned to human values by fusing the outputs of multiple "moral agents" representing diverse ethical perspectives, outperforming single-agent approaches.