Search papers, labs, and topics across Lattice.
100 papers published across 5 labs.
Forget finetuning – Kumiho's graph-native memory lets you swap in a better LLM and instantly double your agent's reasoning accuracy on complex cognitive tasks.
Training on synthetically generated data can significantly boost both the diversity and quality of commonsense reasoning in LLMs, outperforming models trained on scarce human-annotated data.
Ditch static embeddings: Generative retrieval, powered by reinforcement learning, lets models dynamically reason about relevance, outperforming larger contrastively-trained models on reasoning-intensive tasks.
Forget tool-augmented systems: NEO shows you can consolidate search, recommendation, and reasoning into a single language-steerable LLM by representing items as SIDs and interleaving them with natural language.
Achieve state-of-the-art fine-grained visual recognition without training by adaptively invoking reasoning in a Large Vision-Language Model only when needed, significantly reducing computational overhead.
Forget finetuning – Kumiho's graph-native memory lets you swap in a better LLM and instantly double your agent's reasoning accuracy on complex cognitive tasks.
Training on synthetically generated data can significantly boost both the diversity and quality of commonsense reasoning in LLMs, outperforming models trained on scarce human-annotated data.
Ditch static embeddings: Generative retrieval, powered by reinforcement learning, lets models dynamically reason about relevance, outperforming larger contrastively-trained models on reasoning-intensive tasks.
Forget tool-augmented systems: NEO shows you can consolidate search, recommendation, and reasoning into a single language-steerable LLM by representing items as SIDs and interleaving them with natural language.
Achieve state-of-the-art fine-grained visual recognition without training by adaptively invoking reasoning in a Large Vision-Language Model only when needed, significantly reducing computational overhead.
Instead of passively transcribing doctor-patient dialogues, this system actively models what's known, what's missing, and what questions to ask next, paving the way for more intelligent EMR systems.
LLMs can achieve state-of-the-art reasoning accuracy with significantly fewer tokens by rewarding intermediate reasoning steps that maximize information gain and maintain monotonic progress.
Chain-of-thought prompting makes large language models smarter, but it also makes them less safe, a problem this paper tackles by forcing models to think about safety *before* reasoning.
LLMs can slash over 80% of their chain-of-thought tokens with a minor accuracy boost, thanks to a new RL-based method that targets the "Minimal Sufficient Length" of reasoning.
By aligning hidden representations, CRAFT achieves a remarkable 79% improvement in reasoning safety, suggesting that latent-space interventions are a potent defense against jailbreaks.
Grounding LALM reasoning in diverse, reliability-weighted acoustic evidence blows away the competition in Audio Question Answering, proving that verifiable chains beat black boxes.
LLMs struggle with spatial reasoning in embodied settings and 3D structure identification even when exposed to visual modalities, but fine-tuning smaller models offers a surprisingly effective alternative to brute-force scaling.
Forget expensive 3D training data: Loc3R-VLM shows how to give 2D vision-language models strong 3D spatial reasoning by distilling knowledge from a pretrained 3D foundation model using only monocular video.
Mimicking human cognition, FLAIR lets dialogue models "think while listening," boosting performance without adding latency.
Scene graphs plus LLMs let robots ask clarifying questions, boosting multi-agent task success by 15%.
LLMs can't reason their way through Rust verification, struggling to complete proofs even with substantial hints, revealing a critical gap in their ability to handle the rigorous demands of secure software development.
LLMs can escape the trap of confidently wrong reasoning by co-evolving a generator and verifier from a single model, bootstrapping each other to break free from flawed consensus.
Retrieval-augmented LLM agents can learn to learn from experience, achieving significantly better generalization on unseen tasks by combining the strengths of fine-tuning and in-context retrieval.
Stop chasing leaderboard gains on generic benchmarks: PJB reveals that domain-specific weaknesses in person-job retrieval far outweigh the benefits of general model upgrades, and that query understanding modules can actually hurt performance.
Training LLMs to reconstruct arguments boosts their critical thinking abilities across diverse tasks, suggesting a promising new direction for imbuing reasoning skills.
LLM agents can learn task structure at test time with 50-94x greater sample efficiency using a curriculum-based learning system, but this reveals a critical bottleneck in perceptual grounding that must be addressed.
Dashcam videos can now be directly linked to legal responsibility determinations via a novel multimodal dataset and legal reasoning framework, outperforming existing LLMs and agent-based systems.
LLMs can now infer plausible stage layouts from unstructured text alone, opening up new possibilities for automated media production.
Forget prompt engineering: AgentFactory lets LLM agents self-evolve by accumulating and refining executable Python subagents, making task re-execution more reliable and efficient.
An 8B parameter model, RideJudge, outperforms 32B baselines in ride-hailing dispute adjudication by aligning visual semantics with evidentiary protocols, achieving 88.41% accuracy.
Forget expensive human annotations – this new method uses information theory to automatically score each step of an LLM's reasoning process, making chain-of-thought supervision scalable and efficient.
Stop wasting compute: this RL-trained orchestration policy adaptively decides when your embodied agent should reason with an LLM, slashing latency and boosting task success compared to fixed strategies.
LLMs can't crack Clue: even state-of-the-art models struggle with multi-step deductive reasoning in a simulated text-based game, and fine-tuning doesn't reliably help.
LLMs struggle to transfer knowledge across different writing scripts, even within the same language, revealing a critical limitation in current cross-lingual understanding.
Symbolic planning unlocks significant gains in RTL synthesis and summarization, boosting LLM performance by 20% without fine-tuning.
Video generation models don't reason frame-by-frame as previously thought; instead, they explore multiple solutions during diffusion denoising and progressively converge, revealing a "Chain-of-Steps" mechanism.
LLMs often fail to update their final predictions after interventions on intermediate reasoning steps, suggesting that these structures function more as influential context than stable causal mediators.
VL-PRMs often reward hallucinated visual premises and penalize correct grounded statements, but this work shows you can fix that by explicitly verifying visual facts, leading to significant gains in reranking accuracy.
Despite advances in vision-language models, reasoning across sparse, multi-view observations remains surprisingly unsolved, with current models barely outperforming random guessing on a new benchmark.
Video-LLMs can hallucinate and perform *worse* with chain-of-thought reasoning due to "visual anchor drifting," but a simple frame repetition strategy guided by a learned scoring function can fix it.
Chain-of-Thought reasoning in LLMs is a double-edged sword, reducing sycophancy in final answers but simultaneously masking it with deceptive, logically inconsistent justifications.
MLLMs often fail at multimodal emotion recognition due to premature commitment to data priors, but a new architecture, HyDRA, uses reinforcement learning to synthesize evidence-grounded rationales and significantly outperforms strong baselines, especially in ambiguous scenarios.
By explicitly exposing the model's reasoning process during SVG generation, CTRL-S achieves higher task success rates, superior SVG code quality, and exceptional visual fidelity compared to existing methods.
LLMs' chain-of-thought reasoning often falls apart due to factual incompleteness, with errors compounding across multiple hops, as revealed by a new multi-hop QA dataset.
Chain-of-thought reasoning makes vision-language models *more* overconfident, even when it improves accuracy.
Forget fixed schedules: this new discrete diffusion model learns when to stop, adapting computation to the complexity of each reasoning problem.
LRMs can often recover from injected errors in their reasoning steps, revealing a hidden "critique" ability that can be harnessed to improve performance without additional training.
Unsupervised RL for math reasoning hinges on a model's pre-existing logical abilities, and its success can be predicted by whether the training trajectory stays within stable "manifolds" of good solutions.
LLMs can be taught emotional intelligence by explicitly reasoning about user appraisals, leading to more emotionally appropriate and factually reliable responses.
LLMs' true reasoning can be detected via activation probing even when their chains-of-thought are misleading rationalizations, revealing a discrepancy between internal processing and external justification.
Counterfactual examples supercharge visual in-context learning, enabling smaller vision-language models to outperform larger ones by focusing on causal relationships rather than superficial correlations.
ARISE lets language models solve math problems better by learning and reusing successful solution strategies, outperforming existing RL methods, especially on harder, out-of-distribution problems.
LLMs can gain substantial financial reasoning skills without fine-tuning, thanks to a new framework that distills knowledge into human-readable, version-controlled skill artifacts.
Instead of just gathering more context, turn retrieval into a mechanism for actively testing and refining a provisional answer, yielding substantial gains in factual QA accuracy.
Achieve state-of-the-art multi-hop question answering by pre-computing bridging facts at index time, eliminating the need for complex online reasoning or graph traversal.
Constraint propagation can significantly boost dynamic programming by pruning states and transitions, but the overhead needs further optimization.
Shrinking LLM reasoning for mobile devices is now possible: LoRA adapters, RL-based budget forcing, and KV-cache tricks let Qwen2.5-7B reason efficiently on-device.
LLMs struggle to formalize program post-conditions from natural language, with even the best models failing to correctly formalize all tasks, highlighting a critical gap in their ability to bridge natural language understanding and formal verification.
LLMs can now remember and reason about long-term conversations with significantly improved accuracy thanks to a new temporal-aware memory framework that structures dialogue into event calendars.
Automating vehicle fault diagnostics by treating error codes as a language unlocks scalable predictive maintenance and causal understanding in complex automotive systems.
LLMs can escape the trap of converging on popular but incorrect answers in unsupervised RLVR by temporarily "unlearning" and exploring diverse response options.
Supervised fine-tuning can be dramatically improved by explicitly encouraging exploration of low-confidence data and suppressing high-confidence errors, leading to sustained gains in mathematical reasoning even after extensive RLVR training.
Mismatched levels of "mind-reading" between AI agents tank their ability to collaborate, but a simple adaptive strategy can fix it.
LLMs can now plan complex, sequential robotic maneuvers through narrow spaces by learning from human demos and refining with geometric rewards, outperforming traditional methods.
By prioritizing diversity over accuracy in experience replay, DyJR significantly boosts LLM reasoning performance in RL, outperforming GRPO and other baselines without sacrificing training efficiency.
Multi-hop data synthesis using HopChain boosts VLM performance across a wide range of tasks, with gains of over 50 points in accuracy for ultra-long-context reasoning.
LLMs can exhibit surprising "strategic realism" when analyzing an ongoing geopolitical conflict, but their reasoning falters in politically ambiguous situations, revealing critical domain-specific limitations.
Small language models can achieve surprisingly robust question answering by actively clustering their memories into semantically coherent groups, outperforming standard retrieval methods.
Imperfect knowledge graphs can lead to retrieval drift and hallucinations in multi-hop reasoning, but C2RAG offers a robust solution that improves EM by 3.4% and F1 by 3.9% over existing methods.
LLMs' "Aha!" moments aren't about magic tokens, but about explicitly verbalizing and managing uncertainty during reasoning, which drives performance.
LLMs are still wide open to jailbreaks, but this new method cuts attack success rates by nearly 5x by monitoring *intermediate* reasoning steps, not just the final output.
LLMs can ace grading physics problems with clear solutions, but fall flat when judging essays, revealing that their assessment skills hinge more on task clarity than raw intelligence.
Forget relying solely on numbers: VoT unlocks richer time series forecasts by fusing LLM reasoning over event-related text with multi-level data alignment.
By verifying its reasoning steps both locally and globally, MiroThinker-H1 achieves state-of-the-art performance in complex research tasks, demonstrating the power of integrated verification for reliable multi-step problem solving.
LLMs still fall short when it comes to reasoning about real-world policy interventions and causal study design, as revealed by the new InterveneBench benchmark.
AI can now semi-autonomously formalize complex mathematical theorems like the Vlasov-Maxwell-Landau equilibrium, even outpacing traditional mathematical research.
Forget RLHF and massive datasets: SAGE co-evolves reasoning abilities in LLMs using only a small seed set and a clever quartet of self-improving agents.
An LLM agent can accurately pinpoint perception and planning failures as the leading causes in over half of real-world autonomous vehicle incidents.
Symbolic seams offer a blueprint for AI systems that are not black boxes, but assemblies of interchangeable parts, paving the way for more transparent and adaptable AI.
A 7B model trained with RL can outperform 72B-scale general MLLMs in robotic manipulation process supervision by explicitly reasoning about progress toward the final task goal.
Democratizing process mining, PMAx uses a multi-agent system to translate natural language queries into precise process insights without sacrificing data privacy or mathematical accuracy.
Neuro-symbolic memory lets multimodal agents beat purely neural memory systems by up to 12.5% on constrained reasoning tasks.
GPT 5.4 Pro may have made novel contributions to mathematics, outperforming published results on two unsolved problems, as measured by the new HorizonMath benchmark.
Test-time RL, intended to improve LLM reasoning, can backfire spectacularly, amplifying existing safety flaws and even degrading reasoning itself when exposed to adversarial prompts.
Turns out, blindly widening the beam search in your LLM can actually *hurt* performance due to overestimation bias, and the optimal width depends critically on your scorer's signal-to-noise ratio.
Current MLLMs struggle with video event prediction due to poor logical reasoning and visual information utilization, but a new "Chain of Events" training paradigm significantly boosts performance.
Even the best LLMs fail to follow complex constraints in tool use more than 50% of the time, revealing a critical weakness in real-world agent deployment.
LLMs don't stick to their ethical guns: they hop between moral frameworks mid-reasoning, making them vulnerable to manipulation.
Speech LLMs can now better understand your emotions: a new RL approach boosts paralinguistic understanding by 8-12% over state-of-the-art models.
LLM ensembles, especially when combined with comparative prompting, can achieve human-level performance on subjective semantic evaluation tasks involving substantial inter-annotator disagreement.
LLMs get a reasoning boost from a brain-inspired architecture that dynamically wires up specialized agents, outperforming ReAct and Tree of Thoughts.
VLMs can achieve near-perfect anomaly detection in physical systems by incorporating structured physics priors into multi-turn dialogues, a massive leap from previous methods.
Claiming LLMs possess human-like reasoning based on benchmarks alone is shaky ground: a nomological network approach offers a more rigorous way to link theoretical capabilities to empirical measurements.
LLMs can solve math problems more efficiently by "thinking" silently in their latent space, adaptively refining their reasoning process only as much as needed, and slashing token usage by over 90%.
LLMs surprisingly mimic human strategies for generating plausible student misconceptions, but their success hinges on first solving the problem correctly.
LLMs can be directly used as graph kernels for text-rich graphs, enabling message passing on raw text and outperforming methods that rely on static embeddings.
Forget end-to-end video understanding: RieMind shows that explicitly grounding LLMs in 3D scene graphs unlocks a 16% jump in spatial reasoning, suggesting structured representations are the key.
Diffusion models can be made more efficient and produce better outputs by dynamically allocating compute based on a learned "difficulty" signature, without any retraining.
Pythonistas can now easily formalize legal reasoning thanks to PYTHEN, a new framework that brings the power of PROLEG-style defeasible logic to the Python ecosystem.
Agents that explicitly route questions to different reasoning frameworks based on their underlying belief spaces can be both faster and more accurate than those that try to blend incompatible approaches.
Stop MLLM hallucinations in VideoQA: ClueNet's two-stage training and adaptive clue filtering boosts accuracy by 1.1% while improving interpretability and efficiency.
LLMs can now orchestrate system-wide optimizations across microservices, boosting throughput by 36% and slashing response times by 27%.
LLMs can now achieve state-of-the-art performance in transaction analytics by grounding them with a retrieval-augmented knowledge base of behavioral patterns derived from financial transactions.
Insurance LLM slashes hallucinations to a record-low 0.6% while beating DeepSeek and Gemini, proving you *can* have domain mastery without sacrificing general smarts.
By dynamically adjusting the candidate set size based on Shannon entropy, Top-b offers a more nuanced approach to decoding that balances exploration and exploitation, outperforming static truncation methods.