Search papers, labs, and topics across Lattice.
100 papers published across 5 labs.
Building AI tutors in the real world is hard: outdated tech, spotty internet, and curriculum gaps can derail even the best-designed systems.
Quantum education gets a boost: specialized LLM agents in a classroom setting not only improve tutoring reliability but also reveal hidden curriculum gaps.
LLM agents can better discover and assess risks of skills when those skills are represented in a structured format that explicitly represents scheduling, execution structure, and logic, rather than relying on unstructured text.
Test-time RL's vulnerability to noisy pseudo-labels is amplified by group-relative advantage estimation, but can be mitigated with a surprisingly simple debiasing and denoising approach.
Finally, a practical implementation for globally-optimal repair-based semantics allows for querying inconsistent prioritized data with theoretical guarantees.
LLM agents can better discover and assess risks of skills when those skills are represented in a structured format that explicitly represents scheduling, execution structure, and logic, rather than relying on unstructured text.
Decomposing image editing tasks into meta-tasks and aligning model reasoning with editing behavior unlocks surprising generalization to unseen editing operations.
LLMs can parrot CAN bus data, but CAN-QA reveals they fail at the temporal reasoning and multi-condition inference needed for real-world vehicle security forensics.
Forget expensive per-task search: agentic workflows can be synthesized in a single LLM pass by transferring learned structural priors, slashing optimization costs by 3 orders of magnitude.
LLMs harbor surprisingly nuanced and pervasive mental health stigma, revealed only by dissecting their reasoning steps, not just their final answers.
RL's superior generalization isn't about brute force, but about carefully sculpting a few key features while preserving the base model's knowledge, unlike SFT's rapid specialization.
LLMs fail to reliably track source trustworthiness in Turkish evidential marking, unlike humans, highlighting a critical gap in their ability to perform nuanced reasoning based on source reliability.
LLMs can now formalize natural language with significantly higher fidelity, thanks to a clever roundtrip verification method that self-diagnoses and repairs translation errors.
Small language models can achieve reasoning performance rivaling larger models, even under tight token budgets, by using a lightweight "guidance track" to strategically prune and refine their chain-of-thought reasoning.
Dependency-controlled context and explicit evidence sufficiency criteria are key to preventing premature stopping and improving the consistency of enterprise research outputs.
LLMs still can't pass history class: even state-of-the-art models struggle with complex historical reasoning, as revealed by a new benchmark based on the Chinese Imperial Examination.
Scaling up LLMs doesn't guarantee expertise: Korean-specific models beat larger global models on a new meteorology benchmark, exposing critical gaps in multimodal reasoning and cultural understanding.
Building AI tutors in the real world is hard: outdated tech, spotty internet, and curriculum gaps can derail even the best-designed systems.
Quantum education gets a boost: specialized LLM agents in a classroom setting not only improve tutoring reliability but also reveal hidden curriculum gaps.
LLMs can now audit cross-chain smart contracts with expert-level precision, achieving 95% coverage of vulnerable projects by explicitly mirroring human reasoning processes.
LLMs can find and fix bugs in complex codebases far better when structured as a team of reasoning agents, outperforming existing methods by a large margin.
Separating geometry from logic with fuzzy path constraints yields motion planning specifications that are both more intuitive for humans and more amenable to learning from demonstrations.
GraphRAG's black-box reasoning gets a spotlight: XGRAG reveals how specific knowledge graph components influence LLM outputs, boosting explanation quality by 14.81% over standard RAG explainability methods.
VLMs can be taught to self-correct hallucinations at the token level, leading to substantial gains in reasoning accuracy across diverse benchmarks.
Stop relying on LLMs to "hallucinate" reasoning paths – SEARCH-R uses a fine-tuned Llama3.1-8B model and dependency tree-based retrieval to navigate multi-hop question answering more reliably.
LLMs can't handle the truth: SLIDERS beats GPT-4.1 on long-context QA by sidestepping the context window entirely.
Highlighting pivotal evidence can boost LLM performance without altering the original context, leading to substantial improvements in reasoning tasks.
Training a single model across text, images, video, 3D geometry, and hidden representations unlocks "Context Unrolling," where the model reasons across modalities to improve reasoning fidelity.
Inductive biases make machine learning models better at spotting mechanistic reasoning in student discussions, even when those students are tackling new problems.
LLMs can now reason across long conversations without breaking the bank: StructMem slashes token usage and API calls while boosting temporal reasoning.
LLMs generate better features when you make them think harder: CoFEE enforces cognitive behaviors like backward chaining and subgoal decomposition, boosting feature quality by 15% while slashing costs.
Forget memorizing table headers: TaNOS unlocks surprisingly robust numerical reasoning by pre-training on operation sketches and correctness-guaranteed programs.
LLMs can plan complex trips far more effectively when their reasoning is structured as a "forest" of parallel behavior trees, each handling a subtask and coordinated globally.
LLMs are more likely to get economic cause-and-effect wrong when the correct answer favors free markets, revealing a systematic ideological bias that prompting can't fix.
Test-time RL's vulnerability to noisy pseudo-labels is amplified by group-relative advantage estimation, but can be mitigated with a surprisingly simple debiasing and denoising approach.
LLMs can achieve a form of self-programming by integrating crowdsourced learning and human creativity to iteratively refine their own game-playing logic.
Forget flat numerical compression – GS-Quant unlocks better knowledge graph completion by generating discrete codes that mirror the hierarchical nature of human reasoning.
Ditch the fixed interface: DiffMAS unlocks surprisingly large gains in multi-agent reasoning by jointly optimizing latent communication, outperforming text-based and prior latent methods by a wide margin.
A novel logic-based approach makes inferring complex, temporally-extended events from timestamped data tractable, even in the messy real-world of medical records.
LLMs can be both faster and smarter: pre-learned reasoning skills cut down token usage while boosting accuracy on coding and math problems.
Forget chain-of-thought prompting – iterative refinement guided by structured verbal critique from a stronger LLM can achieve SOTA reasoning performance without any training.
Unseen token generalization in transformers isn't just about copying; it's fundamentally limited by a representational collapse in the unembedding space.
Finally, a practical implementation for globally-optimal repair-based semantics allows for querying inconsistent prioritized data with theoretical guarantees.
Forget rigid workflows: HiCrew's planning layer dynamically orchestrates agents for video understanding, adapting roles and execution paths to the nuances of each question.
Hybrid architectures that combine attention and recurrence can maintain reasoning performance as task complexity increases, while transformers see a sharp performance drop-off.
Forget hand-annotated visual reasoning datasets: VG-CoT leverages a fully automated pipeline to generate grounded, step-by-step reasoning, enabling scalable and cost-efficient training of more trustworthy LVLMs.
VLMs' struggles with abstract visual reasoning aren't primarily due to weak reasoning, but rather a representational bottleneck in extracting the right symbolic information from pixels.
Even the most advanced LLMs like GPT-5.2 and Gemini-3 stumble on complex optimization problems, achieving only 27% accuracy on a new benchmark spanning stochastic, dynamic, and game optimization.
Structured graph memory can outperform full-context prompting for cross-session LLM reasoning, but optimizing for specific reasoning skills can hurt overall performance.
Frozen LLMs, when fused with spatial scene encodings, can effectively reason about vehicle trajectories, opening new avenues for integrating language-based reasoning into autonomous driving systems.
Scientific reasoning gets a visual upgrade: S1-VL lets models "think with images" by writing and executing Python code to manipulate visuals during multi-step problem solving.
Spatial reasoning gets a boost: a new framework dynamically orchestrates vision-language agents at test time, outperforming fixed-pipeline approaches by adapting to the reliability of different spatial cues.
Expert knowledge, encoded in a Bayesian network, can dramatically improve the accuracy of autonomous robotic triage systems operating in chaotic, data-scarce environments.
Current multimodal LLMs still struggle to integrate information and reason critically when assessed on real scientific papers, despite progress on isolated tasks.
LLMs can now directly predict geographic coordinates with high accuracy, even for vague locations and complex regions, bypassing the need for traditional geocoding pipelines.
LLMs can now automatically generate formal specifications for real-world programs with high precision and recall, thanks to a novel specification refinement mechanism that leverages program mutations.
LLMs can write better stories if they plan the plot on a graph first.
Fine-tuning a single LLM to both reason about and predict future occupations surprisingly beats using two separate fine-tuned LLMs for each task.
LLMs can reason better when they're not forced to answer in English, and a new RL method leverages this quirk to boost performance across reasoning tasks.
LLMs that ace math exams can still be stumped by problems crafted by other LLMs, revealing a surprising gap between solving and problem-posing abilities.
Finally, a structured argumentation framework that doesn't break basic logical rules!
Lithology classification gets a reasoning upgrade: GeoMind's agentic workflow beats static methods by grounding decisions in geological evidence and constraints.
Open-source MLLMs can now achieve state-of-the-art accuracy on complex tabular reasoning tasks, even outperforming models 18x their size, by explicitly penalizing visual hallucinations and shortcut guessing through process-supervised RL.
LLMs can reason more effectively by directly tracking their own belief in the correct answer throughout the reasoning process, enabling more targeted policy updates.
Identifying causal effects can now be achieved in quasi-polynomial time, transforming the feasibility of causal inference in complex datasets.
LLMs can pinpoint mental states but falter at predicting dialogue trajectories, revealing a critical gap in their reasoning capabilities.
R2IF achieves up to 34.62% better performance in function calling accuracy, bridging the gap between reasoning and decision-making in LLMs.
Forget one-shot generation: Mol-Debate's iterative debate loop unlocks state-of-the-art molecular design by dynamically reconciling semantic intent with structural feasibility.
Achieve more reliable and interpretable virtual cell perturbation predictions by combining knowledge-driven multimodal modeling with evidence retrieval.
Even the best large vision-language models struggle with multi-image reasoning, scoring only 50% on a new benchmark designed to challenge their capabilities.
LLMs can learn to reason more effectively by breaking down the reasoning process and optimizing each step individually.
Ontology augmentation transforms LLMs into robust reasoning agents, significantly boosting performance in complex planning tasks.
LLMs can now perform feature model analysis with near-solver accuracy directly from semi-formal blueprints, unlocking early validation in software product line scoping.
Machine intelligence can transform high-stakes decision-making by enhancing situational awareness and reducing uncertainty, ultimately fostering greater accountability.
Open-source LLMs running on commodity hardware can rival proprietary models on complex actuarial reasoning tasks, but only if you use an LLM judge instead of multiple-choice questions to evaluate them.
LLMs can overcome flawed initial hypotheses and achieve state-of-the-art reasoning by proactively identifying and resolving missing information before committing to a solution.
Standard LLMs can now perform complex bimanual robot manipulation tasks with impressive success rates, all without any task-specific training.
LLMs can autonomously navigate the notoriously complex task of alloy phase diagram construction, outperforming traditional ML methods and even exhibiting complementary strengths when combined with domain-specific models.
LLMs can generate better features from tabular data when deployed as a multi-agent system with explicit memory of past procedures, feedback, and concepts.
Medical VQA models can now reason more reliably thanks to a new framework that disentangles true causal effects from spurious correlations by jointly tackling observable and unobservable confounders.
Key contribution not extracted.
Forget prompting LLMs to directly predict hundreds of fields: a two-stage approach with a stable intermediate JSON summary and a deterministic compiler achieves strong performance on CRF filling while being language-agnostic.
LLMs' reasoning chains are surprisingly fragile at logical connectives, but targeted interventions at these "forking points" can dramatically improve accuracy more efficiently than brute-force methods.
Solving NP-hard combinatorial optimization problems like QAP just got a whole lot faster, thanks to a novel MCMC finetuning approach that achieves near-zero optimality gaps.
Smaller LLMs can achieve superior optimization performance by inheriting structured knowledge distilled from the memories of larger models, without any training.
LLMs can achieve state-of-the-art unsupervised multimodal entity linking by reasoning over diverse evidence types, including graph-based neighborhood information.
Reasoning across languages doesn't have to break the bank: a new framework slashes token costs by over 50% while maintaining accuracy, especially boosting performance in low-resource languages.
MLLMs still struggle with the spatiotemporal reasoning needed to understand surgical videos, even with chain-of-thought prompting.
MLLMs still struggle to integrate diverse data for clinical reasoning, as evidenced by their poor performance on a new ophthalmology benchmark spanning image quality assessment to diagnosis.
Unleashing the full potential of multimodal LLMs requires reasoning directly in the visual latent space, and this paper shows how to do it with stable policy optimization.
Deterministic decoding can outperform stochastic self-consistency in constrained domains by systematically exploring high-probability reasoning traces, leading to better performance with less computation.
Video-ToC drastically improves video understanding by forcing Video LLMs to focus on relevant visual cues, leading to state-of-the-art performance and reduced hallucinations.
LVLMs can self-detect and correct object hallucinations by focusing on specific image regions, offering a simple, training-free fix.
LLMs can learn to play complex games far more effectively by co-evolving a skill bank with a decision-making agent, enabling consistent long-horizon decision-making.
VLMs are often functionally blind, exploiting language priors instead of truly "seeing" visual data, and this problem paradoxically *worsens* as language models scale.
Reasoning LLMs can now produce well-calibrated confidence estimates without labels or repeated sampling, unlocking more reliable real-world deployment.
Forget tedious manual adjustments: SmartPhotoCrafter automatically enhances photos by reasoning about image quality and generating targeted edits.
TNNs, a promising alternative to GNNs, can express precisely the binary classifiers definable in topological counting logic, revealing their superior expressive power.
LLMs can learn to reason over complex text-rich networks in a zero-shot manner using reinforcement learning alone, outperforming methods relying on supervised fine-tuning or distillation.
Winning Mafia against human players requires more than just brute force: Revac-8 shows how combining memory, social network analysis, and adaptive communication can outwit even the most deceptive opponents.
Autoformalization gets a major upgrade: DSR's neuro-symbolic approach leverages operator trees to outperform end-to-end LLMs, proving that structured representations are key to bridging human and formal mathematics.
Small language models can achieve strong performance in specialized scientific domains like quantum field theory with targeted fine-tuning and synthetic data generation.
Teaching LLMs to perform arithmetic on images unlocks a new level of grounded reasoning, paving the way for robots that can understand and manipulate the world more like humans.
LLMs can achieve near-perfect tool use accuracy and minimal hallucination when reasoning about financial time series, but only if they're allowed to delegate to external tools.
LLMs can now reason far better in low-resource domains, thanks to a new method that aligns their thinking with high-resource domains using "reasoning representation alignment."