Search papers, labs, and topics across Lattice.
100 papers published across 12 labs.
LLMs can reason more efficiently by triaging queries and applying deep thought only when truly needed, thanks to a new coarse-to-fine inference framework.
Stop wasting RL on easy problems: a difficulty-aware curriculum for SFT and RL unlocks better reasoning in LLMs.
Forget rigid pipelines and static prompts: Nurture-First Development lets domain experts grow AI agents through conversation, turning tacit knowledge into reusable assets.
Reasoning rerankers don't magically fix fairness issues in search, preserving the biases of their input rankings despite boosting relevance.
Autonomous driving's next leap hinges on reasoning, not just perception, but current LLM-based approaches are too slow for real-time control.
Stop wasting RL on easy problems: a difficulty-aware curriculum for SFT and RL unlocks better reasoning in LLMs.
Forget rigid pipelines and static prompts: Nurture-First Development lets domain experts grow AI agents through conversation, turning tacit knowledge into reusable assets.
Reasoning rerankers don't magically fix fairness issues in search, preserving the biases of their input rankings despite boosting relevance.
Autonomous driving's next leap hinges on reasoning, not just perception, but current LLM-based approaches are too slow for real-time control.
Forget brittle KG traversals: MDER-DR's entity-centric summaries and decomposed queries boost multi-hop QA accuracy by up to 66% over standard RAG.
Achieve up to 12x greater sample efficiency in reasoning tasks by relaxing strict imitation constraints in on-policy distillation, enabling smaller models to match the performance of much larger ones.
Can a dedicated research program keep a smaller, local LLM competitive against global giants in the rapidly evolving AI landscape?
A 7B model, guided by verifiable execution rewards, can now rival the code reasoning of models more than four times its size.
Unlock massive multilingual reasoning data: the Multilingual Reasoning Gym enables parallel data generation across 14 languages, opening doors for training and evaluating multilingual reasoning models at scale.
LLM agents can now learn from their mistakes and successes in complex tasks, improving performance by up to 28.5% by extracting and applying structured learnings from past execution trajectories.
By forecasting compact world dynamics before taking action, DynVLA leapfrogs traditional CoT methods to achieve more informed and physically grounded autonomous driving decisions.
Uncover the hidden causal chains inside your LLM with Causal Concept Graphs, which outperform existing methods for reasoning by explicitly modeling concept dependencies.
LLMs can be made better software engineers by pre-training them to reconstruct the messy, iterative development process that led to the final, clean code in repositories.
Clinicians using HeartAgent, a cardiology-specific agent system, improved diagnostic accuracy by 26.9% and explanatory quality by 22.7% compared to unaided experts.
Multilingual math reasoning just got a serious upgrade: mAceReason-Math offers a meticulously translated and cleaned dataset of challenging problems across 14 languages, purpose-built for RLVR training.
Clinical AI can achieve clinician-level diagnostic accuracy and continuous improvement via a self-evolving framework that actively learns from clinical experience.
By fusing language model reasoning with diffusion-based trajectory generation, KnowDiffuser leapfrogs existing autonomous driving planners on the nuPlan benchmark.
Forget scaling reasoning – this work shows that scaling visual perception using code-grounded data is the real key to unlocking MLLMs' STEM abilities.
By grounding LLMs in a hybrid knowledge base and using a Chain of Verification approach, PharmGraph-Auditor turns unreliable LLM generators into transparent reasoning engines for prescription auditing.
Explicitly teaching LVLMs to reason step-by-step with reinforcement learning unlocks state-of-the-art performance on multimodal object-entity relation extraction.
LLMs can now autonomously retrieve relevant memories from a database using specialized tools, significantly improving performance on long-term conversational question answering.
LVLMs can be jailbroken by "Reasoning-Oriented Programming," which chains together harmless visual inputs to trigger harmful reasoning, much like return-oriented programming in traditional security exploits.
An AI agent can triage remote patient monitoring data with higher sensitivity than individual clinicians, suggesting a path to scalable and cost-effective patient monitoring.
Reasoning unlocks factual knowledge in LLMs, but beware: hallucinated reasoning steps can poison the well.
LLMs can now emulate debuggers, stepping through code and setting breakpoints, opening the door to more interactive and controllable neural program execution.
Stop training LLMs on lucky guesses: this new RL method uses the model's own in-context learning ability to identify and upweight high-quality reasoning traces, leading to better performance.
LLMs that ace standard coding benchmarks spectacularly fail at esoteric languages, revealing a reliance on memorization rather than true reasoning.
By communicating in a shared latent space, Latent-DARM lets you combine the global planning of diffusion models with the fluency of autoregressive models, boosting reasoning accuracy by up to 14% while slashing token usage.
LLM agents can now achieve a +41pp boost in first-try success and 100% accuracy in 2-way logistics compositions by using PRECEPT's novel combination of retrieval, memory, and prompt evolution.
LLMs can evolve surprisingly effective, interpretable Python planners that rival state-of-the-art classical planners, at a fraction of the computational cost.
LLMs often choose moral consistency over basic common sense, especially when the contradiction is committed by the main character in a narrative.
LLMs struggle to generate diverse and specific connections between concepts, even with high token budgets and "thinking" prompts, revealing a gap in creative associative reasoning.
Forget brittle multi-hop reasoning: TaSR-RAG's taxonomy-guided triple matching boosts RAG performance by 14% without costly graph construction.
Chain-of-Agents can reason more accurately over long contexts by processing information chunks in an order determined by Chow-Liu dependency trees, rather than relying on default or semantic similarity.
LLM-powered recommendation agents can now autonomously investigate and bridge information gaps, leading to better recommendations, thanks to a new tool-augmented reasoning framework.
Even a single error from a conditional independence oracle can prevent the unique identification of a Bayesian network structure, regardless of bounded graph parameters like treewidth.
A 4B parameter model can now beat much larger models at social reasoning, thanks to a new RL framework that aligns model reasoning trajectories with human cognition.
Retrieval-augmented agents get a serious reasoning boost by explicitly evaluating their own retrieval quality at each step, leading to state-of-the-art performance on multi-hop question answering.
LLMs that dominate in strategic reasoning often choke in real-time zero-sum games, revealing a critical strategy-execution gap that current benchmarks miss.
Stop letting sparse rewards bottleneck your VLN agent: SACA disentangles failed trajectories into valid prefixes and divergence points for dense supervision, unlocking SOTA performance.
LLMs' attention patterns subtly shift with emotional tone, and explicitly accounting for these shifts during training improves reading comprehension even on neutral datasets.
Pathology MLLMs can now better incorporate diagnostic standards during reasoning, thanks to a new memory architecture inspired by how human pathologists process information.
Text-only foundation models can perform surprisingly well on complex 3D spatial reasoning tasks, rivaling multimodal models, when equipped with a structured spatial representation derived from 3D reconstruction.
By cleverly "self-rephrasing" LLM outputs, this work coaxes reasoning LLMs to handle audio inputs without sacrificing their chain-of-thought abilities.
Achieve up to 11x navigation performance gains in functional buildings by explicitly encoding and exploiting a priori spatial knowledge.
LLMs can now tackle complex table QA with 20%+ accuracy gains, thanks to a multi-agent framework that decomposes queries and orchestrates reasoning between specialized database and knowledge graph agents.
A new process reward model acts as a universal geospatial verifier, scaling the performance of both specialized and general-purpose VLMs in remote sensing.
LLMs can get a 27.8% boost in mathematical reasoning by fusing a hardware-efficient optimal control layer directly into their architecture, enabling planning before prediction.
AutoAgent dynamically evolves agent cognition and memory to achieve superior performance in complex, dynamic environments, without requiring external retraining.
LLMs can be steered away from hallucination and towards more robust reasoning by using contrastive learning to capture the shared structure of successful reasoning paths, even when the final answer is correct.
LLM reasoning research is inadvertently paving a dangerous path towards AI situational awareness and strategic deception, demanding a re-evaluation of current safety measures.
Achieve more efficient reasoning in Transformers without increasing test-time cost by using training-only techniques that guide attention and dynamically adjust sharpness.
Human-AI interaction isn't just augmentation, it's a new cognitive entity with its own emergent "vibe," demanding we rethink epistemology and education.
By explicitly optimizing for both reasoning structure and chemical consistency, Logos offers a pathway to reliable and interpretable AI systems for molecular science, outperforming larger models with a fraction of the parameters.
Unlock multimodal interleaved generation in existing vision-language models without large interleaved datasets using a novel reinforcement learning approach with hybrid rewards.
LLMs get *more* honest when they have time to reason, defying human tendencies and revealing surprising insights about their internal representational geometry.
LLMs trained with reinforcement learning from verifiable rewards (RLVR) become overconfident in incorrect answers, but a simple fix—decoupling reasoning and calibration objectives—can restore proper calibration without sacrificing accuracy.
Mixture-of-Experts models might be hiding more of their reasoning than we thought, thanks to a newly quantified "opaque serial depth" metric.
Looping helps transformers think harder on math problems, while memory lets them remember more commonsense facts, and combining both beats simply scaling up layers.
LLMs may secretly be better at information retrieval than embedding similarity suggests, but current datasets are too "short-sighted" to prove it.
LLMs can switch between reasoning and factual answering on the fly, without retraining, simply by conditioning on specific token prefixes.
LLM agents can learn to continuously adapt and improve in complex environments by reflecting on past experiences and explicitly storing/retrieving reusable lessons, leading to substantial performance gains.
Even the most advanced LLMs stumble when asked to reason over a large, heterogeneous document corpus, achieving only 34% accuracy on the new OfficeQA Pro benchmark despite direct access to the relevant documents.
LLMs can reliably guide industrial maintenance decisions when constrained by deterministic evidence construction and rule-based verification, even with incomplete and heterogeneous data.
LALMs can now better capture the nuances of human emotion, moving beyond single-label predictions with a new ambiguity-aware training framework that aligns model outputs with the full spectrum of human perception.
Despite the buzz around Tiny Recursive Models, directly adapting their refinement mechanism into autoregressive architectures yields no reliable performance boost, suggesting the original TRM's success may stem from other factors.
SuperInvesting, a specialized AI system, significantly outperforms general-purpose LLMs like GPT and Gemini on a new financial intelligence benchmark, suggesting domain-specific architectures are crucial for reliable investment research.
LLM-powered agents can now better recover from errors by classifying failure types and retrieving relevant historical contexts.
Multimodal LLMs struggle with math because they *see* poorly, but a multi-agent system focused on visual evidence can dramatically improve their perception and reasoning.
Current multimodal math models struggle with visual interpretation, symbol alignment, and consistent reasoning, highlighting the need for a unified "Perception-Alignment-Reasoning" framework.
LLMs can now reliably extract complex, n-ary drug combinations from biomedical text, surpassing previous methods that were limited to binary interactions.
MLLMs can gain a surprising boost in 3D spatial reasoning simply by encoding 3D geometric attributes of objects as textual references indexed by unique IDs.
Swapping variables in mathematical formulas during graph contrastive learning surprisingly improves retrieval accuracy by preserving crucial algebraic relationships.
Stop wasting compute: CODA dynamically adjusts reasoning depth based on problem difficulty, slashing token costs by 60% on easy tasks while boosting performance on hard ones.
Even the best open-weight LLMs still fail on nearly two-thirds of questions requiring reasoning over scientific tables, highlighting a persistent "execution bottleneck" in translating strategy to action.
Forget token counting: this work introduces a semantic prior based on surprisal to compress LLM reasoning traces, achieving better accuracy and fluency than heuristic length penalties.
Rigorous proof establishes the correctness of CoPPar Tree, guaranteeing consistency in parallel computations.
Forget bigger models: clever prompt engineering with explicit decision rules crushes fine-tuning and embeddings for word sense disambiguation.
Current LLM benchmarks hide critical reasoning failures in long, multimodal documents, which BRIDGE exposes through step-level evaluation.
Hallucinations in multimodal reasoning models are linked to high-entropy transition words, and can be reduced by decoding with probability-weighted continuous embeddings rather than discrete tokens during these uncertain states.
LLMs still fail to follow complex instructions that entangle content, formatting, control flow, and real-world constraints, despite progress on simpler benchmarks.
LLMs can reason more efficiently by triaging queries and applying deep thought only when truly needed, thanks to a new coarse-to-fine inference framework.
Skip the expensive supervised fine-tuning: this RL-only method teaches LLMs to use tools by showing them how in-context, then gradually removing the crutches until they're tool-using pros in zero-shot.
LLMs can now safely navigate the complexities of acupuncture clinical decision support, thanks to a neuro-symbolic framework that slashes safety violations from 8.5% to zero.
Human-AI collaboration using LLMs and symbolic solvers just cracked a notoriously hard problem in combinatorial design theory, finding a tight lower bound on Latin square imbalance.
LLMs can slash inference costs by 80% without sacrificing accuracy, simply by learning to recognize when their own reasoning is shaky and needs a second opinion.
LLMs can significantly boost micro-expression recognition by reasoning about subtle facial muscle movements when guided by structured visual and relational prompts.
LLMs can achieve state-of-the-art results on complex reasoning tasks with far fewer parameters by iteratively excavating and reasoning over external knowledge.
Achieve up to 52.5% compression in LLM chain-of-thought reasoning *while improving* accuracy by dynamically calibrating CoT length.
By decomposing RAG along the document axis with specialized agents, SPD-RAG achieves state-of-the-art performance on multi-document QA while slashing API costs by over 60%.
Particle filtering reveals a fundamental limit to inference-time sampling methods for LLMs, suggesting that simply increasing the number of samples has diminishing returns.
Forget fuzzy language – CoCo uses executable code as Chain-of-Thought to generate images with unprecedented control and precision, blowing away existing methods on complex scenes.
Continuous reasoning in latent space crushes explicit reasoning for multilingual tasks, especially when training data is scarce.
Strategic data curation using a dual-consensus approach beats brute-force training on large noisy datasets for process reward modeling in biological reasoning.
LLMs still struggle with complex legal reasoning, as evidenced by their difficulty in solving Islamic inheritance cases, even with a new dataset designed to support step-by-step reasoning.
LLMs can orchestrate human input to UAVs, dramatically improving mission success rates while minimizing human interaction.
LLMs can generate better recommendations if they pause to verify their reasoning steps, rather than reasoning in one long chain.
Robots get smarter at in-context learning by "thinking" visually about future trajectories, leading to better generalization and success rates in manipulation tasks.
VLMs can't count blocks because they lack a view-consistent spatial interface, but decomposing scenes into orthographic projections fixes it.
Chain-of-Thought prompting doesn't always improve LLMs' ability to solve discrete optimization problems, and surprisingly, "disordered" datasets can sometimes boost performance on simpler tasks.