Search papers, labs, and topics across Lattice.
66 papers published across 9 labs.
Stop wasting RL on easy problems: a difficulty-aware curriculum for SFT and RL unlocks better reasoning in LLMs.
AssistMimic enables humanoid robots to learn complex, force-exchanging assistive motions by reformulating imitation learning as a multi-agent RL problem.
Human-preference aligned audio generation from video is now possible, as V2A-DPO surpasses previous methods by directly optimizing for semantic consistency, temporal alignment, and perceptual quality.
By selectively injecting teacher demonstrations only during failure, HAPO overcomes the limitations of both pure RL and mixed-policy optimization in sparse-reward RLVR, enabling models to surpass static teacher forcing.
LLM-as-a-judge consensus is often an illusion: models agree on surface-level features, but diverge wildly when evaluating true quality, a problem fixable by injecting domain knowledge into rubrics.
Stop wasting RL on easy problems: a difficulty-aware curriculum for SFT and RL unlocks better reasoning in LLMs.
AssistMimic enables humanoid robots to learn complex, force-exchanging assistive motions by reformulating imitation learning as a multi-agent RL problem.
Human-preference aligned audio generation from video is now possible, as V2A-DPO surpasses previous methods by directly optimizing for semantic consistency, temporal alignment, and perceptual quality.
By selectively injecting teacher demonstrations only during failure, HAPO overcomes the limitations of both pure RL and mixed-policy optimization in sparse-reward RLVR, enabling models to surpass static teacher forcing.
LLM-as-a-judge consensus is often an illusion: models agree on surface-level features, but diverge wildly when evaluating true quality, a problem fixable by injecting domain knowledge into rubrics.
LLMs can be better aligned to human values by fusing the outputs of multiple "moral agents" representing diverse ethical perspectives, outperforming single-agent approaches.
Stop training LLMs on lucky guesses: this new RL method uses the model's own in-context learning ability to identify and upweight high-quality reasoning traces, leading to better performance.
Get 6x the RLHF alignment for your LLM with a new active learning pipeline that focuses on annotating the most informative response pairs.
Recommendation welfare can provably exceed any learner-measurable treatment policy when downstream actors possess private information, forcing a critical re-evaluation of learning objectives in bandit settings with noncompliance.
A new video-based reward model beats GPT-5.2 and Gemini-3 Pro at evaluating computer-using agents, offering a scalable, model-agnostic alternative to traditional methods.
Escape the CVAE decoder bottleneck: SPAARS unlocks better offline-to-online RL by safely exploring a latent space, then seamlessly switching to raw actions.
Forget expensive human annotations: RubiCap uses LLM-generated rubrics to train image captioning models via RL, achieving superhuman performance and even improving VLM pretraining.
LLMs can be steered away from hallucination and towards more robust reasoning by using contrastive learning to capture the shared structure of successful reasoning paths, even when the final answer is correct.
Forget hand-engineered reward functions: Reward-Zero uses language embeddings to give RL agents an intrinsic "sense of completion," dramatically improving sample efficiency and generalization.
Forget finetuning on curated datasets – OpenClaw-RL lets agents learn directly and continuously from *every* interaction, turning user replies, tool outputs, and even GUI changes into valuable RL signals.
Unlock multimodal interleaved generation in existing vision-language models without large interleaved datasets using a novel reinforcement learning approach with hybrid rewards.
LLMs trained with reinforcement learning from verifiable rewards (RLVR) become overconfident in incorrect answers, but a simple fix—decoupling reasoning and calibration objectives—can restore proper calibration without sacrificing accuracy.
LLM-based judges, widely used for automated evaluation, are riddled with diverse biases that can be significantly reduced through bias-aware training using RL and contrastive learning.
Robot mistakes are surprisingly forgiving – humans penalize slips and freezes more harshly, and sometimes even interpret certain mistakes as successes.
Users prefer robots that learn their preferences using CMA-ES-IG because it suggests more perceptually distinct and informative behaviors to rank.
LLMs can switch between reasoning and factual answering on the fly, without retraining, simply by conditioning on specific token prefixes.
LLM agents can learn to continuously adapt and improve in complex environments by reflecting on past experiences and explicitly storing/retrieving reusable lessons, leading to substantial performance gains.
Mitigate the brittleness of RLHF by explicitly controlling for disagreement and tail risk during inference, without retraining, using a KL-robust optimization framework.
Forget noisy, biased LLM evaluators: CDRRM distills preference insights into compact rubrics, letting a frozen judge model leapfrog fully fine-tuned baselines with just 3k training samples.
Humanoid robots can now maintain balance under complex external forces without force/torque sensors, thanks to a force-adaptive RL policy that learns to anticipate and compensate for disturbances.
Intrinsic reward signals in unsupervised RL for LLMs inevitably collapse due to sharpening of the model's prior, but external rewards grounded in computational asymmetries offer a path to sustained scaling.
Instead of imitating reflections, LLM agents can be trained to reason about action quality by rewarding correct judgments between alternative actions, leading to improved performance and generalization.
Human and AI feedback in RLHF are surprisingly susceptible to "choice blindness," where manipulated preferences often go unnoticed, undermining the reliability of alignment signals.
Concave multi-objective RL suffers from a previously unaddressed gradient bias that doubles the sample complexity, but this can be fixed with multi-level Monte Carlo or, surprisingly, vanishes entirely with smooth scalarization functions.
By "imagining" new scenarios and asking "What if this were the true preference?", CRED actively designs environments and trajectories to expose differences between competing reward functions, dramatically improving preference learning.
Achieve better token efficiency in LLM policy optimization by using a novel FiberPO objective whose Jacobian is block-diagonal over trajectories and reduces to identity on-policy.
Jumpstart your research agent: synthetic tool-use plans overcome exploration bottlenecks and boost performance by up to 6% on multi-hop reasoning tasks.
Strategic data curation using a dual-consensus approach beats brute-force training on large noisy datasets for process reward modeling in biological reasoning.
Forget massive datasets – targeted training on a smaller, carefully curated dataset of challenging competitive programming problems yields 3x faster gains in code generation performance.
By rethinking RLHF, MicroCoder-GRPO enables smaller code generation models to rival larger counterparts, achieving significant performance gains and revealing 34 training insights.
Finally, you can use human feedback and other real-world, non-differentiable rewards to fine-tune fast, few-step diffusion models.
LLM agents can learn to solve complex, long-horizon tasks much more effectively by using themselves as post-hoc critics to refine Q-values through hindsight reasoning.
Forget external debuggers: ReflexiCoder teaches LLMs to self-reflect and self-correct code, rivaling GPT-5.1 in performance while slashing inference costs by 40%.
Recursive self-improvement can boost performance by 18% in code and 17% in reasoning, but only if you can keep it from going off the rails – SAHOO provides the guardrails.
StableDRL tames the wild instability of applying reinforcement learning to diffusion language models, enabling more reliable post-training optimization.
Train one RL agent to handle a whole family of reward functions, unlocking robust and adaptable policies without the complexity of multi-task training.
Weak LLMs, when strategically leveraged via confidence-based sample weighting, can not only drastically cut preference alignment costs but also surpass the performance of models trained on full human-labeled datasets.
RLHF's reliance on gradient-based alignment inherently limits its depth, causing it to focus on early tokens and neglect later, potentially harmful, contextual dependencies.
Grounding reward learning in natural language rationales makes policies 2x more robust to spurious correlations and distribution shifts.
Ditch the planner-tracker hierarchy: RL can directly control spherical robots for efficient point-to-point navigation, even transferring from sim-to-real with high stability.
A quadrupedal robot learns to climb steep slopes by "feeling" its own instability, using a learned Tumble Stability Margin to proactively avoid falls.
Reinforcement learning with audio-text semantic rewards can overcome confirmation bias in test-time adaptation, leading to more robust ASR in noisy and accented speech environments.
Debate between AI models hits a phase transition: it's useless when they know the same things, but becomes essential as their knowledge diverges.
Forget scaling laws: reinforcement fine-tuning with verifiable rewards lets a 4B parameter model beat an 8B parameter model on challenging 3D scene understanding tasks.
PPO's fixed clipping hurts exploration by squashing high-reward, low-probability actions, but BandPO fixes this with probability-aware bounds that boost performance.
Current judge models for instruction-following are surprisingly unreliable, but a new benchmark exposes their flaws and offers a path to better alignment.
LLMs get stuck in their ways: even explicit corrections can't break their rigid adherence to initial (incorrect) reasoning paths in multi-turn interactions, but a novel RL approach can fix it.
By explicitly modeling the latent human evaluation process, VRM offers a more robust reward model, sidestepping the pitfalls of spurious correlations that plague traditional methods.
A 4B parameter SLM can now rival frontier agent performance in complex tool-use environments, thanks to a novel reinforcement finetuning framework that teaches it how to strategically acquire context and execute actions.
By disentangling state representation from policy optimization, DSRM-HRL breaks the accuracy-fairness tradeoff in recommender systems, achieving state-of-the-art fairness without sacrificing utility.
Achieve over 2x training speedup for LLM reasoning without sacrificing accuracy by dynamically pruning Group Relative Policy Optimization (GRPO) with a novel importance sampling correction.
Unlock 2x faster reinforcement learning by distilling group feedback into actionable language refinements that guide exploration.
TaxonRL doesn't just beat humans at bird identification; it shows its work, revealing a transparent reasoning process that could revolutionize how we trust AI in complex visual tasks.
Learn a critic for coding agents from human-in-the-loop interaction traces alone, sidestepping the need for dense, verifiable rewards.
LLM safety can be bypassed with optimal transport, revealing that refusal mechanisms are surprisingly localized within a few layers.
Flow matching's advantage in RL isn't distributional modeling, but rather its ability to correct value estimates iteratively and learn more adaptable features, leading to significant performance gains in challenging online settings.
Ditch hard clipping: GIPO's Gaussian-weighted importance sampling offers a smoother, more stable RL policy optimization, especially when dealing with stale or limited data.
Multilingual LLMs can be made significantly more reliable by directly optimizing for crosslingual consistency using a DPO-inspired method that requires no explicit reward model.
HALyPO stabilizes human-robot collaboration by directly certifying the convergence of decentralized policy learning in parameter space, sidestepping the oscillations that plague standard MARL approaches.
LLMs struggle to maintain consistent personalization as conversations lengthen and preferences become less explicit, suggesting current models fall short of truly adaptive personal assistants.
Forget inspecting final outputs: LLMs telegraph their reward-hacking intentions internally, early in the generation process, via distinctive activation patterns.