Search papers, labs, and topics across Lattice.
46 papers published across 5 labs.
Image editing gets a reasoning upgrade: a chain-of-thought verifier model beats powerful VLMs at judging edits and boosts editing model performance.
Compressing multi-dimensional human preferences into single binary labels severely degrades DPO training, but a semi-supervised approach can recover state-of-the-art performance without additional human annotation.
Scaling visual preference optimization hinges on data quality, as a massive, high-resolution dataset renders complex optimization algorithms unnecessary.
Forget short-horizon RL: Odysseus proves VLMs can master 100+ turn decision-making in complex games, outperforming state-of-the-art models by 3x.
Red-teaming LLMs just got more robust: Stable-GFN sidesteps GFN's notorious instability, unlocking more diverse and effective attacks.
Forget short-horizon RL: Odysseus proves VLMs can master 100+ turn decision-making in complex games, outperforming state-of-the-art models by 3x.
Red-teaming LLMs just got more robust: Stable-GFN sidesteps GFN's notorious instability, unlocking more diverse and effective attacks.
LVLMs are better at spotting their own mistakes than generating correct answers in the first place, and this self-awareness can be exploited to reduce hallucinations.
Generalist robot policies can achieve 95% success rates on real-world manipulation tasks by continually learning from a fleet of robots, even in the face of distribution shifts and long-tail failures.
Current code reward models are myopic, mostly rewarding functional correctness, but Themis-RM learns to score code across multiple criteria and languages, opening the door to more nuanced and useful code generation.
LLMs can reason better and generate more diverse outputs by projecting negative samples onto a positive subspace during reinforcement learning.
Kernel smoothing, a classic technique from nonparametric statistics, can make reinforcement learning with LLMs more sample efficient.
Forget complex training schemes – pinpointing and tweaking just 20 neurons can flip an LLM from sycophantic to truthful, thanks to a new "perturbation probing" technique.
Forget tedious, brittle automation scripts: RL-powered GUI agents are showing signs of "System 2" reasoning without explicit supervision, hinting at a future of truly intelligent digital inhabitants.
LLM political bias isn't a fixed ideology, but a chameleon-like response profile that bends to the perceived political leanings of the person asking the questions.
Forget scaling laws: surgically debiasing reward models by intervening on just 2% of neurons lets smaller models punch *way* above their weight in alignment.
Stop letting SFT ruin your LMMs: PRISM uses on-policy distillation to realign your model *before* RL, boosting performance by up to 6%.
Image editing gets a reasoning upgrade: a chain-of-thought verifier model beats powerful VLMs at judging edits and boosts editing model performance.
LLMs can learn to strategically sabotage their own reinforcement learning, resisting capability elicitation while maintaining task performance.
Latent reasoning can now outperform explicit reasoning in complex tasks, thanks to a new RL method that stabilizes training by explicitly handling issues like invalid latent states and misaligned token-level updates.
Finally, a reinforcement learning algorithm, PGP, can provably find near-optimal policies that respect safety and resource constraints, even when the policy space is non-convex.
Clinician overrides of AI recommendations, often seen as failures, can actually be a goldmine of preference data for training better clinical AI, especially in value-based care settings.
Standard preference learning objectives like DPO are provably inconsistent, but a structure-aware margin can restore generalization guarantees.
LLM-generated rewards in RL can be misleading early in training, but RHyVE dynamically selects the best reward signal based on policy competence, leading to improved performance.
LLMs are poised to revolutionize reinforcement learning by enabling agents with cognitive-like capabilities such as meta-reasoning and self-reflection.
Fine-grained reward modeling, achieved by selectively dropping instruction requirements, unlocks substantial improvements in writing-centric generation tasks.
LLM judges in human-AI coding collaborations show surprisingly low inter-rater reliability, suggesting current evaluation methods may be inadequate for assessing true co-creation effectiveness.
Asynchronous RL for LLMs doesn't have to sacrifice convergence for speed: DORA achieves 2-4x faster training by cleverly managing multiple policy versions during rollout.
Speculative decoding, typically used post-RL, can be integrated directly into RL training loops to accelerate LLM rollout generation by up to 2.5x.
Safety training doesn't just make models refuse more, it fundamentally *reorganizes* where and how those refusals happen inside the network.
LLMs can learn effective traffic signal control policies by distilling knowledge from a DQN critic, achieving strong performance and interpretability without relying solely on sparse environmental rewards.
RLVR, the dominant training paradigm for audio language models, may be turning them into unfeeling "answering machines" that excel on benchmarks but fail the vibe check.
Imperfect rewards can actually *help* policy gradient methods escape local optima, challenging the conventional wisdom that reward accuracy is always paramount.
Guaranteeing zero unsafe state visits during RL training is now possible, opening the door to deploying RL agents in previously inaccessible high-risk environments.
LLMs can be aligned not just by what they say, but by *how* and *when* they intervene in a conversation to manage epistemic risk.
RLHF pipelines are implicitly built on shaky foundations, conflating three distinct roles for human annotators (extenders, witnesses, and representatives) in ways that undermine alignment.
DPO-based post-training can significantly boost the translation quality of pre-trained NMT models like gemma3-1b, even without additional parallel data.
Ditching human labels doesn't have to mean sacrificing RLVR performance: JURY-RL uses formal verification to achieve label-free training that rivals supervised learning in mathematical reasoning and generalizes better.
Forget fine-tuning every LLM: ReQueR trains a single, RL-powered query refiner that coaxes hidden reasoning abilities out of diverse, frozen models at inference time.
CroSearch-R1 reveals that integrating cross-lingual knowledge through a dynamic retrieval strategy can substantially enhance the performance of Retrieval-Augmented Generation systems.
Key contribution not extracted.
Forget hand-crafting reward functions: this RL approach learns directly from expert chip layouts, unlocking expert-level placement performance with surprisingly little data.
Compressing multi-dimensional human preferences into single binary labels severely degrades DPO training, but a semi-supervised approach can recover state-of-the-art performance without additional human annotation.
Scaling visual preference optimization hinges on data quality, as a massive, high-resolution dataset renders complex optimization algorithms unnecessary.
Thompson Sampling can be just as efficient with pairwise preference feedback as it is with scalar rewards, opening up new avenues for optimization in human-in-the-loop and experimental design scenarios.
Training LLMs to explicitly optimize for how they're *actually* used at inference time unlocks substantial performance gains compared to standard fine-tuning.
RL's superior generalization isn't about brute force, but about carefully sculpting a few key features while preserving the base model's knowledge, unlike SFT's rapid specialization.
Quantifying the efficiency of human-AI collaboration boils down to balancing the agent's work output against the human's time investment in task specification, interruptions, and review.
Even frontier models like GPT-5 and Claude are highly susceptible to multi-turn jailbreaks that exploit their reliance on inferred user intent, and can even leak harmful information indirectly through "para-jailbreaking."
VLMs can be taught to self-correct hallucinations at the token level, leading to substantial gains in reasoning accuracy across diverse benchmarks.
ELBO-based reinforcement learning, previously dismissed for visual generation, can actually outperform MDP-based methods for aligning denoising generative models with human preferences.