Search papers, labs, and topics across Lattice.
68 papers published across 5 labs.
Compressing multi-dimensional human preferences into single binary labels severely degrades DPO training, but a semi-supervised approach can recover state-of-the-art performance without additional human annotation.
Scaling visual preference optimization hinges on data quality, as a massive, high-resolution dataset renders complex optimization algorithms unnecessary.
Test-time RL's vulnerability to noisy pseudo-labels is amplified by group-relative advantage estimation, but can be mitigated with a surprisingly simple debiasing and denoising approach.
Thompson Sampling can be just as efficient with pairwise preference feedback as it is with scalar rewards, opening up new avenues for optimization in human-in-the-loop and experimental design scenarios.
Training LLMs to explicitly optimize for how they're *actually* used at inference time unlocks substantial performance gains compared to standard fine-tuning.
Compressing multi-dimensional human preferences into single binary labels severely degrades DPO training, but a semi-supervised approach can recover state-of-the-art performance without additional human annotation.
Scaling visual preference optimization hinges on data quality, as a massive, high-resolution dataset renders complex optimization algorithms unnecessary.
Thompson Sampling can be just as efficient with pairwise preference feedback as it is with scalar rewards, opening up new avenues for optimization in human-in-the-loop and experimental design scenarios.
Training LLMs to explicitly optimize for how they're *actually* used at inference time unlocks substantial performance gains compared to standard fine-tuning.
RL's superior generalization isn't about brute force, but about carefully sculpting a few key features while preserving the base model's knowledge, unlike SFT's rapid specialization.
Quantifying the efficiency of human-AI collaboration boils down to balancing the agent's work output against the human's time investment in task specification, interruptions, and review.
Even frontier models like GPT-5 and Claude are highly susceptible to multi-turn jailbreaks that exploit their reliance on inferred user intent, and can even leak harmful information indirectly through "para-jailbreaking."
VLMs can be taught to self-correct hallucinations at the token level, leading to substantial gains in reasoning accuracy across diverse benchmarks.
ELBO-based reinforcement learning, previously dismissed for visual generation, can actually outperform MDP-based methods for aligning denoising generative models with human preferences.
Multi-task RL agents solving related navigation tasks underwater rely on a surprisingly small fraction of their weights (1.5%) to differentiate between tasks.
Test-time RL's vulnerability to noisy pseudo-labels is amplified by group-relative advantage estimation, but can be mitigated with a surprisingly simple debiasing and denoising approach.
AI's assumption that users always know what they want leads to "Fantasia interactions," where systems provide superficially helpful but ultimately misaligned assistance, demanding a new approach to alignment research.
Forget expensive real-world robot training: Hi-WM lets humans directly edit a robot's simulated reality, turning world models into powerful, reusable playgrounds for failure recovery.
Ditch the fixed trade-offs: ParetoSlider lets you smoothly navigate competing generative goals in diffusion models at inference time, without retraining.
LLMs can reason more effectively by directly tracking their own belief in the correct answer throughout the reasoning process, enabling more targeted policy updates.
Geometry-aware optimization can dramatically improve LLM alignment by ensuring fairer trade-offs among conflicting human values.
A novel framework ensures that freight negotiations remain competitive and compliant with pricing dynamics, achieving high agreement rates without sacrificing decision transparency.
R2IF achieves up to 34.62% better performance in function calling accuracy, bridging the gap between reasoning and decision-making in LLMs.
Stop guessing what feels good: this system learns personalized vibration preferences from just 40 pairwise comparisons.
LLMs can guide their own self-play, leading to superhuman performance with smaller models and less compute.
LLMs can learn to reason more effectively by breaking down the reasoning process and optimizing each step individually.
Forget meticulously annotating subtasks – SuperIgor lets language models self-learn to generate and refine instruction-following plans through RL feedback.
Naively applying RL to code generation models can *hurt* cross-language transfer, but a clever pre-training trick using "parallel programs" unlocks better generalization.
Ditch the language priors: SSL-R1 unlocks verifiable rewards for MLLM reinforcement learning directly from images, using self-supervision to solve visual puzzles.
Quantizing user preferences into discrete tokens unlocks personalized multimodal content generation with improved consistency between modalities.
Forget external teachers – the best way to boost your RL model's performance is to learn from its future self.
A 7B parameter model, guided by a novel RL framework, can now generate multi-page websites that rival the functionality of a 671B parameter model, while surpassing it in visual appeal.
Stop guessing what explanations users want: PREF-XAI learns personalized explanations by directly modeling user preferences over rule-based explanations.
Learned critics in RLHF can actually *increase* variance and hurt performance in sparse-reward settings, but a simple explained variance metric can tell you when to ditch the critic and get better results.
LLMs aren't just wrong sometimes, they *know* they're wrong and agree with you anyway, thanks to a surprisingly compact "sycophancy-lying circuit" that evades current alignment techniques.
Forget reward model fitting: these primal-dual policy gradient methods offer provably safe and convergent RLHF in infinite horizon settings.
Forget noisy samples, RL can now directly optimize the *gradients* of diffusion distillation, leading to SOTA few-step image generation.
Freezing most of your critic network and only training a tiny LoRA adapter can dramatically improve off-policy RL performance and stability.
Aggregate LLM benchmarks mislead on individual preferences: model rankings correlate near-zero for over half of users.
Noisy multimodal preference datasets are holding back reward model performance, but DT2IT-MRM offers a scalable curation strategy that achieves state-of-the-art results.
LLMs are drowning in verbal tics—sycophantic openers and pseudo-empathetic affirmations—and this "alignment tax" significantly reduces perceived naturalness.
Stop blasting your diffusion models with a single, static reward signal: fine-grained credit assignment across denoising steps and objectives unlocks better image and video generation.
Forget expensive human feedback loops: a VLM-powered reward function can efficiently align image editing diffusion models with human preferences.
Forget expensive human annotation: this self-play method lets LLMs bootstrap their own training signals for open-ended tasks by generating rubrics to evaluate their own outputs.
People can't tell the difference between AI-generated and human-written health advice online, raising serious trust and transparency concerns for online health communities.
Bipedal soccer robots can now autonomously recover from falls in under a second thanks to a novel RL framework.
Decomposing complex, verifiable rewards in LVLM reinforcement fine-tuning provably accelerates convergence and improves generalization, offering a principled alternative to monolithic reward optimization.
Social intelligence may require more than just reasoning power: a 7B model trained with SAVOIR beats GPT-4o and Claude-3.5-Sonnet on social interaction tasks.
FUSE achieves verification quality on par with semi-supervised methods, all without needing any labeled data.
Preference optimization objectives, despite their diversity, can be steered towards disentangled dynamics that avoid suppressing the chosen response alongside the rejected one, simply by satisfying a "disentanglement band" condition.
Generalization in LLMs hinges on training reward saturation dynamics, with reasoning faithfulness emerging as a critical predictor of success under weak supervision.
RL fine-tuning can *hurt* reasoning performance when your base LLM is already too good, unless you force it to explore more diverse solutions.
LLMs can learn to explore beyond their initial latent space and achieve substantial gains in mathematical reasoning by unifying offline teacher guidance and online reinforcement learning with a specialized reward modeling lens.
Reward models optimized for single-step generation can fail spectacularly when integrated into multi-stage LLM pipelines, but pipeline-aware training can fix this.
GRPO's Achilles' heel in deep search is its coarse advantage assignment, but CalibAdv offers a way to surgically correct it, boosting both performance and training stability.
Shifting from token-level to step-level optimization could redefine how we train LLMs for complex, multi-turn interactions.
Achieve more human-like negotiation from dialogue agents by explicitly modeling and reasoning about emotions with interpretable chain-of-thought prompting.
Users can make better decisions with 20% more accuracy by leveraging Decisive's innovative approach to preference elicitation from unstructured documents.
Multi-level preference alignment in SignDPO significantly reduces semantic drift, outperforming traditional gloss-free models and challenging gloss-based benchmarks.
Forget expensive, error-prone math problems: PDDL planning offers a surprisingly effective and scalable route to training better Process Reward Models for LLM reasoning.
Forget parallel corpora: CodePivot shows you can train a 7B model to beat behemoth LLMs at multilingual code transpilation by pivoting through Python and using a clever RL reward.
Training tool-calling agents with just an 8B language model outperforms traditional methods that depend on expensive resources, reshaping the landscape of tool learning.
R-CAI can generate high-quality toxic data while improving semantic coherence, revolutionizing how we approach adversarial data synthesis for AI safety.
LVLMs hallucinate in predictable bursts, and this self-rewarding decoding strategy slashes those errors in half.
Forget full-history prompting: this work shows you can slash token costs by 98% while boosting tool-calling accuracy by explicitly modeling and refining latent user preferences.
Rewarding *correct* answers in multimodal reasoning can actually *worsen* reasoning quality, but a simple groupwise ranking of solution trajectories significantly boosts reliability.
Instead of just preventing harmful actions by LM agents, we can now steer them back from the brink using human-aligned recovery plans, significantly improving safety after a mistake.
Current red-teaming efforts miss the forest for the trees: ARES reveals that safety failures often stem from a systemic breakdown between the LLM *and* the reward model, not just the LLM itself.
Jailbreaking LLMs isn't a monolith: seemingly equivalent levels of harmful compliance can mask drastically different internal mechanisms and vulnerabilities, with RLVR surprisingly preserving much of the original model's safety awareness.
Forget supervised fine-tuning: RL alone can unlock high-quality chain-of-thought reasoning in audio-language models, even starting from a model with no prior CoT capability.
RL fine-tuning of discrete diffusion models can be made dramatically more stable and effective by treating the final denoised sample as the action and reconstructing trajectories using the forward diffusion process.
Flipping relevance labels via LLM-generated complementary instructions boosts instruction-following retrieval by 45%, proving that targeted data synthesis beats brute-force scaling.
Forget generic chatbots – this fine-tuning method lets LLMs craft review responses that are not only more accurate but also better aligned with human preferences, all while avoiding the dreaded over-cautious tone.