Search papers, labs, and topics across Lattice.
42 papers published across 6 labs.
Aligning diffusion models with just 100 carefully selected samples can beat state-of-the-art preference optimization methods trained on thousands, and converge up to 220x faster.
Training domain-specific coding LLMs with realistic environments and large-scale RL can yield substantial gains in practical software engineering tasks.
LMs can learn to generate multiple plausible answers in a single forward pass, outperforming traditional single-answer models on tasks requiring distributional reasoning and offering a compute-efficient alternative to best-of-k sampling.
LLMs can reason through chains of thought 2.5x longer and achieve 8% higher accuracy on complex math problems by optimizing for token-level influence on future trajectory behavior.
Injecting demonstrations with a carefully annealed probability can drastically improve exploration in RLVR, even for tasks requiring novel reasoning or domain-specific knowledge.
Training domain-specific coding LLMs with realistic environments and large-scale RL can yield substantial gains in practical software engineering tasks.
LMs can learn to generate multiple plausible answers in a single forward pass, outperforming traditional single-answer models on tasks requiring distributional reasoning and offering a compute-efficient alternative to best-of-k sampling.
LLMs can reason through chains of thought 2.5x longer and achieve 8% higher accuracy on complex math problems by optimizing for token-level influence on future trajectory behavior.
Injecting demonstrations with a carefully annealed probability can drastically improve exploration in RLVR, even for tasks requiring novel reasoning or domain-specific knowledge.
Observational user feedback, often dismissed as too noisy and biased, can actually power effective RLHF with the right causal modeling, achieving a 49.2% gain on WildGuardMix.
Scale up offline policy training for diffusion LLMs without breaking the bank: dTRPO slashes trajectory computation costs while boosting performance up to 9.6% on STEM tasks.
Forget prompt engineering – LSE trains LLMs to self-edit their own contexts at test time, outperforming even GPT-5 and Claude Sonnet 4.5 in Text-to-SQL and question answering.
Forget random data mixing: MOSAIC uses failure analysis to intelligently curate training data, leading to better safety, less over-refusal, and improved instruction following, all at once.
Unleashing an LLM's inner creativity or laser-sharp logic is now as simple as turning a knob, thanks to a new distribution-matching method that avoids heuristic rewards.
LLMs surprisingly prioritize norm adherence over personal incentives in business scenarios, challenging assumptions about goal-driven behavior.
Training multi-turn LLM agents just got easier: ProRL Agent offers a scalable, API-driven rollout service that streamlines RL training across diverse tasks.
LLM post-training pipelines can be configured with 10x less compute using AutoPipe, a budget-aware framework that learns from historical runs and predicts performance from early training signals.
Human oversight can be systematically integrated into LLM-based text generation to improve accessibility, creating a traceable and auditable process.
Achieve significant reasoning gains in frozen LLMs (+22.4%) without retraining by adaptively routing reward model guidance at the token level during inference.
Forget fixed decoding strategies – RL can learn a lightweight policy to adapt LLM sampling *at test time*, boosting summarization quality by up to 88% without retraining the LLM.
Learning from ranked preferences alone can be surprisingly difficult: even with access to the full ranking of actions, standard online learning guarantees break down unless the environment is sufficiently stable.
Forget static models: this adaptive framework slashes stock price prediction error by dynamically routing data through specialized pathways based on real-time market regime detection.
Aligning diffusion models with just 100 carefully selected samples can beat state-of-the-art preference optimization methods trained on thousands, and converge up to 220x faster.
Stripping away the complexity of GRPO reveals that simple REINFORCE with group relative advantage can actually *improve* LLM reasoning, challenging the assumption that sophisticated loss functions are always better.
Human-AI teams often fail not because AI is inaccurate, but because humans miscalibrate their reliance on it, highlighting the need for readiness metrics beyond accuracy.
Greedy off-policy learning, optimal in theory, can fail spectacularly when supplies are limited, but a simple fix—prioritizing items with high *relative* reward—can restore performance.
Low-resource language models can get a major boost in translation quality and tokenization efficiency by using reinforcement learning to directly enforce structural constraints like sequence length and linguistic well-formedness during training.
Achieve topologically coherent coronary vessel segmentation by directly optimizing for geometric structure, rather than pixel-wise accuracy, using preference-based learning.
LRMs can be made more efficient and accurate by strategically adjusting their output length based on task difficulty, leading to a better accuracy-length trade-off.
Forget struggling with cryptic SQL: a new LLM fine-tuned with human preferences generates comments so good, they beat Qwen3-14B by up to 13% on standard metrics.
Decomposing GUI agent trajectories into verifiable milestones and auditing the evidence chain yields a 10% boost in RL training performance, outperforming single-judge reward systems.
Forget hand-crafting agents: Memento-Skills lets a generalist LLM agent autonomously design and improve specialized agents through experience, achieving substantial gains on complex benchmarks.
On-policy reward modeling with LLM judges not only unlocks significant performance gains on complex mathematical reasoning tasks, but also generalizes to improve performance on simpler numerical and multiple-choice benchmarks.
A 30B MoE model can now achieve Gold Medal-level performance in IMO, IOI, and ICPC, rivaling frontier models with 20x more parameters.
Imagine a single algorithm that dominates in both predictable and chaotic ranking scenarios – this paper delivers it for multi-dueling bandits.
Skip the expensive reward model: RewardFlow distills sparse task rewards into dense, state-level signals by propagating credit through the topology of LLM reasoning trajectories.
Aligning rewards with sub-goals and emphasizing key trajectory segments with hindsight information significantly improves multi-turn agentic RL, outperforming existing methods on complex tasks.
Robots can now navigate based on your spoken preferences and visual context, thanks to a clever fusion of VLMs, LLMs, and multi-objective RL.
LLMs can achieve state-of-the-art reasoning accuracy with significantly fewer tokens by rewarding intermediate reasoning steps that maximize information gain and maintain monotonic progress.
RL agents can learn far more efficiently by dynamically distilling and leveraging past experiences that co-evolve with the agent's growing capabilities.
LLMs can act as effective action-level supervisors in reinforcement learning, dramatically boosting the sample efficiency of SAC without sacrificing convergence guarantees.
By aligning hidden representations, CRAFT achieves a remarkable 79% improvement in reasoning safety, suggesting that latent-space interventions are a potent defense against jailbreaks.
LLMs can escape the trap of confidently wrong reasoning by co-evolving a generator and verifier from a single model, bootstrapping each other to break free from flawed consensus.
RLHF for autoregressive video generation gets a boost with AR-CoPO, which overcomes the limitations of SDE-based methods by using chunk-level alignment and a semi-on-policy training strategy.
Forget expensive human annotations – this new method uses information theory to automatically score each step of an LLM's reasoning process, making chain-of-thought supervision scalable and efficient.
Online RLHF can match the performance of offline RLHF with 10x less data, and potentially 1000x at scale.
Autonomous vehicles can now leverage the rich semantic understanding of VLMs for safer driving without the computational overhead, thanks to a clever training strategy that distills VLM knowledge into a real-time RL policy.