Search papers, labs, and topics across Lattice.
Chain-of-thought prompting, mathematical reasoning, logical inference, and step-by-step problem solving in LLMs.
#8 of 24
1
GRPO's credit assignment failures—treating all tokens as equally important and misaligning step-level rewards—can be overcome with a self-supervised approach that mines the model's intrinsic information flow.
Finally, a way to train LLM agents to reason step-by-step without needing humans to check every intermediate thought.
Forget retraining: this model learns interpretable logical rules from data in a zero-shot manner by encoding literals with domain-agnostic statistical properties.
RL can unlock better compositional generalization than supervised fine-tuning by directly optimizing for correct outcomes, especially on complex tasks where supervised models overfit.
AI can now discover and suggest genuinely novel mathematical inequalities, hinting at its potential for breakthroughs beyond traditional theorem proving.
Forget dumb context stuffing: LongSeeker shows that strategically *editing* its own memory lets agents solve web search tasks with far greater reliability.
LLM multi-agent systems can achieve significantly higher accuracy at a fraction of the cost by learning to selectively delegate tasks instead of relying on rigid orchestration.
Hallucination detection can be nearly as effective with a single forward pass as with expensive multi-sample methods.
Think-Aloud data doesn't just improve cognitive model fit; it fundamentally reshapes the discovered model structure, revealing cognitive mechanisms undetectable from behavior alone.
Coordinating LLM agents with evolving knowledge graphs, rather than just text, unlocks superior scientific ideation, beating state-of-the-art systems on multiple benchmarks.
LLMs can learn to play multi-agent games far better by recursively modeling the reasoning of other players, leading to a 22% performance boost.
Transformers with average attention can natively execute arithmetic circuits, suggesting a new architectural direction for reasoning and computation.
LLMs can get up to 6x more logically consistent without human feedback, simply by fusing NLI scores into the DPO training loop.
LLMs can leapfrog current network troubleshooting benchmarks by explicitly encoding structured diagnostic policies, rather than relying on free-form deliberation.
Small LLMs paired with symbolic solvers can outperform larger zero-shot LLMs on formal reasoning tasks, but still struggle with multilingual inputs.
Tool-using SQL agents can learn to be more efficient and accurate by getting feedback on *how* they reason, not just *what* they output.
State-of-the-art temporal knowledge graph reasoning is now possible by jointly modeling historical evidence and evolutionary dynamics, unlocking complementary predictive signals.
Achieve 8x token reduction in million-token document understanding without sacrificing accuracy by having the LLM actively search for relevant information like a foraging animal.
LLMs can now formulate significantly better penetration testing strategies, outperforming even GPT-5, thanks to a novel reasoning framework and targeted fine-tuning.
Unleashing geospatial reasoning on a torrent of unlabeled remote sensing data, RemoteZero rivals supervised methods by having models verify their own reasoning, not relying on human-annotated coordinates.
Video-LLMs aren't failing at perception, they're being tricked by their own assumptions, but a new dataset and reasoning chain can fix it.
Standard retriever evaluations hide critical weaknesses in agentic search systems, but a new benchmark and training method exposes and addresses these flaws.
LLMs struggle with causal reasoning when noise is introduced, but explicitly modeling causal graphs can dramatically improve performance and generalization.
A hierarchical agent that separates visual and textual contexts drastically improves multi-step reasoning on complex charts, outperforming monolithic MLLMs.
LLMs' own self-judgments, when logically linked to their response features, can significantly improve hallucination detection.
Language models can play the counterexample game, but their philosophical reasoning hits diminishing returns fast, and they're far more lenient judges than humans.
Stop rewarding reasoning that just looks good – reward reasoning that actually *helps* the downstream model solve the task.
Neural retrievers, despite their success on standard benchmarks, fail spectacularly when forced to reason about set-theoretic constraints, revealing a reliance on spurious correlations rather than true compositional understanding.
Rose-SQL achieves state-of-the-art multi-turn Text-to-SQL performance with small models, outperforming larger fine-tuned models without any training.
LLMs struggle with multimodal STEM problems, but a simple dialogue-based intervention can fix 82% of their mistakes without retraining.
LLM-based vulnerability repair can be significantly improved by focusing on root cause analysis, leading to more robust and less superficial patches than current methods.
LLMs struggle to formally verify real-world code, but KVerus's self-adaptive approach closes the gap, enabling verification of complex, evolving Rust systems with significantly improved success rates.
LLM safety filters, which rely on semantic pattern matching, can be bypassed at scale by encoding harmful prompts as coherent mathematical problems, revealing a fundamental vulnerability.
Guaranteeing safe robot navigation in unstructured environments just got easier: translate human language rules into formal logic, ground them with VLMs, and let the robot navigate.
LLMs alone can't reliably fly drone swarms from natural language commands; task-specific tools and runtime guardrails are essential for real-world cyber-physical system control.
LLMs can't reliably orchestrate multi-step manufacturing workflows, but this physics-grounded multi-agent system can, boosting tool execution success by 87.5% while ensuring traceable, risk-aware decisions.
RAG's reputation for being ineffective in reasoning tasks is shattered by showing that retrieving the right data – intermediate "thinking traces" – unlocks substantial performance gains, even for state-of-the-art models.
LLMs can now collaboratively pinpoint root causes in microservices using a tree-structured search, but production environments reveal the limitations of this approach when faced with polyglot stacks and inconsistent logging.
LVLMs can achieve SOTA visual reasoning by learning to "see" in a way that optimizes for reasoning, even if it means deviating from strict geometric accuracy.
Turns out, nobody's explicitly RL-training LLM agents when to *stop* in multi-agent systems, despite its critical role in efficiency and cost.
Forget brittle orchestration layers – LLMs can internalize complex reasoning as a learnable "HeavySkill" that rivals external agentic frameworks.
LLMs can't reliably count beyond a small number of steps, revealing a surprising brittleness in their ability to execute seemingly simple procedures despite fluent performance on complex tasks.
LLMs can reason better and generate more diverse outputs by projecting negative samples onto a positive subspace during reinforcement learning.
Kernel smoothing, a classic technique from nonparametric statistics, can make reinforcement learning with LLMs more sample efficient.
Multi-agent workflows can produce correct answers despite significant internal divergence caused by information contamination, revealing a critical blind spot in current verification methods.
LLMs can synthesize formal safety rules from natural language goals, offering a path to more robust and verifiable AI systems in safety-critical domains.
Unsupervised knowledge injection via fuzzy logic lets image classifiers reason about concepts they were never explicitly trained on, boosting accuracy and generalization.
Iteratively exploring a corpus graph during reranking can substantially boost reasoning-intensive retrieval performance, even with weaker rerankers, offering a surprisingly effective alternative to computationally expensive retriever improvements.
Forget static imitation learning: LaST-R1 unlocks near-perfect robotic manipulation (99.8% success) by adaptively reasoning about physical dynamics *before* acting, then refining with RL.
Latent reasoning can now outperform explicit reasoning in complex tasks, thanks to a new RL method that stabilizes training by explicitly handling issues like invalid latent states and misaligned token-level updates.