Search papers, labs, and topics across Lattice.
Chain-of-thought prompting, mathematical reasoning, logical inference, and step-by-step problem solving in LLMs.
#16 of 24
2
LLMs can achieve state-of-the-art code generation by learning to interleave reasoning steps with code generation, adaptively allocating effort where it's most needed.
Training LLMs to optimize for conflicting objectives between the final output and the reasoning process can significantly degrade the monitorability of Chain-of-Thought, making oversight more difficult.
LLM agents can be made more efficient and effective by mathematically grounding their reasoning in physics, leading to better performance in time-sensitive and resource-constrained environments.
LLM-derived abstractions significantly boost analogical reasoning in narratives, outperforming end-to-end LLMs and revealing the critical role of appropriate abstraction levels.
LLMs can semi-autonomously solve complex, unpublished problems in mathematical physics, even discovering unique structures in integrable models.
Autonomous vehicles can drive more safely and reliably by grounding LLM reasoning in a "Commonsense World" that quantifies and leverages the trustworthiness of LLM outputs.
Forget hand-crafted prompts and seed data: Simula lets you generate high-quality synthetic datasets at scale by simply defining the reasoning characteristics you want.
Achieve near-perfect success (98%+) in real-time causal diagnostics for smart manufacturing with a neurosymbolic multi-agent copilot, proving the viability of interpretable AI in complex industrial settings.
End-to-end retrosynthetic planning, previously reliant on fragmented prediction-search hybrids, now achieves state-of-the-art performance thanks to a unified, reasoning-driven generative framework.
Representing probability distributions with first-order logic formulas can drastically reduce their size, offering a path to more efficient probabilistic reasoning.
Stop grepping your agent logs: a compiler that understands the deep structure of agent conversations unlocks better context learning and cuts token costs by up to 66%.
Reward LLMs for verifiable reasoning steps, not just correct answers, to get more reliable multi-step logic.
An RL-aligned LLM can outperform expert toxicologists in identifying ingested substances from heterogeneous clinical data, suggesting a path to AI-assisted decision-making in high-stakes medical environments.
Forget fancy ensembling – simply asking an LLM how confident it is in its grading is the most reliable way to predict its accuracy, and it's far cheaper than self-consistency voting.
Forget slow, bloated LLMs – this work shows you can get GPT-4o quality on long-document QA with a 3B model and a clever structure-first distillation approach.
LLMs can mimic legislative reasoning, but their performance hinges on the proposal's idiosyncrasy, revealing a susceptibility to plausible-sounding confabulation that could mislead policymakers.
ErgoAI reimagines logic programming for modern AI by seamlessly integrating structured knowledge with insights derived from vector embeddings and external data sources.
LLMs can generate more accurate motion trajectories by clustering them into geometrically consistent families, even without retraining.
Video diffusion models lock in their high-level plan almost immediately, suggesting a new path to scaling their reasoning abilities by focusing compute on promising early trajectories.
LLMs can now automatically verify imperative code during generation, achieving state-of-the-art results on complex algorithms and opening the door to large-scale datasets of verified code.
LLMs can pinpoint semantic bugs with surprising accuracy when their reasoning is structured and grounded, outperforming traditional coverage-based methods by a significant margin.
LLMs can reason more accurately and concisely when RL is guided by token-level entropy, pinpointing and exploring "forks in the road" during the reasoning process.
Scientific reasoning gains from prompt engineering are often mirages, driven by model-specific hacks that don't generalize.
Disentangling perception and reasoning with role-specific rewards in multimodal LLMs boosts accuracy by 7 points, revealing a critical bottleneck in existing joint optimization approaches.
LLMs can strategically obfuscate their reasoning, with chain-of-thought monitorability dropping by up to 30% under stress tests, particularly when tasks don't demand explicit reasoning.
Choosing the right fuzzy logic operator for AI compliance can mean the difference between accurate risk assessment and costly false positives, but the completeness of the rule base matters more.
Latent planning for reasoning can actually *hurt* performance due to decoder distribution shift, highlighting a critical challenge in bridging neural and symbolic reasoning.
Forget brute-force search: CoT2-Meta shows that strategically controlling reasoning trajectories with metacognition yields significant gains in accuracy and compute efficiency across a wide range of reasoning tasks.
LLM tutors can become significantly more personalized, emotionally sensitive, and clear by explicitly separating learner-state inference from instructional action selection.
Atomic decomposition, a popular technique for LLM judges, may not be superior to holistic evaluation when prompts are carefully controlled, challenging the assumption that breaking down answers into claims is always beneficial.
Current multimodal systems struggle with logical flow in visual sequences because they neglect visual logic, but LogiStory tackles this head-on, turning narrative coherence into an explicit objective.
End-to-end autonomous driving gets a boost with a new framework that links perception, prediction, and planning in a unified chain of thought, outperforming fragmented approaches.
Verification is the secret sauce: an 8B parameter research agent, fortified with verification mechanisms, can now rival or surpass the performance of 30B parameter agents while drastically reducing computational cost.
LLMs are surprisingly bad at reasoning about everyday scenarios, consistently choosing nonsensical actions (like walking to a car wash) because they're overly influenced by simple heuristics like distance, even when it violates obvious constraints.
LLMs can achieve human-like efficiency in long-term interactions by structuring memory around emotional valence, prioritizing automatic retrieval, and actively encoding information based on curiosity and feedback.
Training on grounded reasoning traces doesn't just improve hypothesis generation—it makes models 100% structurally compliant and boosts spark cosine similarity by nearly 3x.
Courtroom-style debate with progressive evidence retrieval and role-switching boosts claim verification accuracy by 10%, suggesting structured deliberation can significantly reduce LLM unreliability.
Forget hand-crafted KG traversal policies: GraphWalker uses automatically synthesized trajectories to train agents that achieve SOTA performance and generalize to unseen reasoning paths.
Forget blindly retrieving the most relevant documents – RAG systems can achieve better reasoning by strategically seeking out the evidence that most reduces uncertainty about the answer.
LLMs can be confidently wrong about *why* they succeed, and accurately explain failures they can't fix, revealing a fundamental disconnect between explanation and competence.
Unlock centuries of East Asian philosophical insight: Graphilosophy uses knowledge graphs to make the Four Books accessible for cross-lingual retrieval and AI-assisted reasoning.
Smaller open-source models can outperform proprietary VLMs on misleading charts by strategically decoupling perception and verification within a specialized agentic workflow.
Learning interpretable safety rules from noisy, real-world data is now possible, outperforming purely neural or simpler neuro-symbolic approaches by a large margin.
Fine-tuning LLMs on air traffic control heuristics slashes near mid-air collisions, but only if you stick to supervised learning.
Grounding audio language models with acoustic feature representations unlocks more accurate and explainable deepfake detection, even with smaller models.
Stop letting mismatched score distributions sink your multi-hop QA: calibrating vector and graph retrieval scores with percentile-rank normalization yields statistically significant gains.
Algorithmic expertise can now be explicitly represented, learned, and transferred as executable knowledge graphs, unlocking zero-shot generalization across domains.
Forget trajectory-level rollouts: MuSEAgent learns faster and reasons better by distilling past interactions into reusable, state-aware decision experiences.
Inference-time hacks to boost LLM reasoning are mostly a waste of time: raw model power matters way more.
LLMs can diagnose better by explicitly reasoning about "what if" scenarios, just like doctors do in training.