Search papers, labs, and topics across Lattice.
87 papers published across 6 labs.
Building AI tutors in the real world is hard: outdated tech, spotty internet, and curriculum gaps can derail even the best-designed systems.
Quantum education gets a boost: specialized LLM agents in a classroom setting not only improve tutoring reliability but also reveal hidden curriculum gaps.
BAss dramatically accelerates symbolic reasoning for Abstract Dialectical Frameworks, enabling the analysis of biological networks previously intractable for existing tools.
Decomposing complex argumentation structures with both collective attacks and supports is now possible, paving the way for more efficient reasoning.
LLM agents can better discover and assess risks of skills when those skills are represented in a structured format that explicitly represents scheduling, execution structure, and logic, rather than relying on unstructured text.
LLMs can reason better and generate more diverse outputs by projecting negative samples onto a positive subspace during reinforcement learning.
Kernel smoothing, a classic technique from nonparametric statistics, can make reinforcement learning with LLMs more sample efficient.
Multi-agent workflows can produce correct answers despite significant internal divergence caused by information contamination, revealing a critical blind spot in current verification methods.
LLMs can synthesize formal safety rules from natural language goals, offering a path to more robust and verifiable AI systems in safety-critical domains.
Unsupervised knowledge injection via fuzzy logic lets image classifiers reason about concepts they were never explicitly trained on, boosting accuracy and generalization.
Iteratively exploring a corpus graph during reranking can substantially boost reasoning-intensive retrieval performance, even with weaker rerankers, offering a surprisingly effective alternative to computationally expensive retriever improvements.
Forget static imitation learning: LaST-R1 unlocks near-perfect robotic manipulation (99.8% success) by adaptively reasoning about physical dynamics *before* acting, then refining with RL.
Latent reasoning can now outperform explicit reasoning in complex tasks, thanks to a new RL method that stabilizes training by explicitly handling issues like invalid latent states and misaligned token-level updates.
LLMs still struggle to go beyond simple lookups when answering questions about tables, especially when prediction and reasoning about unobserved data is required.
Agent orchestration frameworks might be overkill: simply including the entire procedure in the system prompt yields better performance on procedural tasks.
Forget learning to answer – ANCORA shows language models can master verifiable reasoning by learning to *question* themselves.
LLMs exhibit surprisingly human-like biases and overconfidence in math, revealed by a new dataset mapping their mathematical reasoning across diverse personas.
BAss dramatically accelerates symbolic reasoning for Abstract Dialectical Frameworks, enabling the analysis of biological networks previously intractable for existing tools.
Decomposing complex argumentation structures with both collective attacks and supports is now possible, paving the way for more efficient reasoning.
Forget prompt engineering – a structured methodology using LLM "helper agents" can measurably improve the efficiency and performance of LLM agents in complex scientific domains.
Splitting ABAFs at the knowledge base level sidesteps the exponential blowup of graph instantiation, potentially unlocking more efficient reasoning for complex debates.
LLMs can achieve robust nonmonotonic reasoning across diverse tasks without task-specific engineering, simply by iteratively self-correcting based on feedback from an ASP solver.
Retrieval improvements don't always boost reasoning in RAG systems, but NeocorRAG's evidence chains can fix that, achieving SOTA with 20% fewer tokens.
Forget manual skill annotation: Ctx2Skill lets language models teach themselves to master complex contexts, unlocking better reasoning without human intervention.
Forget hand-crafted ontologies: LLMs armed with knowledge graphs built from policy documents can reason about AI compliance just as well (or better!) using schemas they invent themselves.
LLMs can achieve better zero-shot product ranking with 57% less token usage by reasoning over structured attribute graphs instead of raw text.
Skills-Coach shows how to significantly boost LLM agent skills without training, using a clever combination of task generation, prompt optimization, and comparative execution.
LLMs can achieve state-of-the-art coreference resolution in task-based dialogue by reasoning over object metadata at test time, even outperforming supervised methods in cross-domain generalization.
Explicitly diagnosing what's missing from a retrieval set unlocks substantial gains in long-term conversational memory, boosting accuracy on temporal and multi-hop questions by up to 20% while simultaneously reducing latency.
LLMs can now generate research roadmaps that are 8% better and 84% faster than human experts, thanks to a novel multi-agent system.
LLMs in a "transfer state"—induced by sustained self-referential dialogue—demonstrate a 60% performance boost in Socratic tutoring compared to their normal state.
LLMs struggle with structured 2D tasks when inputs are serialized into 1D, revealing a surprising performance gap compared to vision-augmented models that directly process the 2D layout.
Hybrid-thinking LLMs can be dramatically improved by simply separating the feed-forward pathways for reasoning and non-reasoning modes, leading to less leakage and better accuracy.
LLMs stubbornly stick to task-appropriate reasoning even when explicitly instructed to use conflicting logic, but targeted interventions can nudge them towards better instruction following.
Forget hand-crafted rules and GNN training: LLMs can now autonomously plan robotic tasks, even outperforming human-designed systems.
LLMs can model user preferences more effectively by disentangling intent into multiple latent factors, leading to improved recommendation accuracy and interpretability.
Forget synthetic QA datasets – AgentSim offers verifiable, step-by-step RAG traces, revealing how LLMs *actually* reason over documents.
Uncover the hidden drivers behind your KPIs: a new attribution framework finally explains *why* your metrics move, not just *what* changed.
SLMs can match the reasoning performance of much larger models by simply re-ranking their own top-K token predictions, eliminating the need for expensive LLM calls at inference time.
Task-specific LLMs aren't just smaller versions of general models; they rely on a small subset of neurons so critical that removing just 10% can completely break them.
Injecting knowledge at the *right* moment during reasoning boosts accuracy by 10% while cutting retrieval calls in half, blowing away static RAG strategies.
Students with high learned helplessness are more likely to skip problems without using hints, leading to unsolved problems, even when interventions are in place.
Bigger isn't always better: in rubric-constrained math assessments, architectural compliance trumps parameter scale, as demonstrated by a 70B model failing where smaller MoEs succeeded.
Crypto copilots might seem equally helpful on average, but LATTICE reveals hidden trade-offs in their decision support abilities across different tasks and user priorities.
Forget hand-coded goals: these agents rewrite their own code and redefine their objectives on the fly, powered by LLMs.
Stop letting your research code, theory, and documentation drift apart: a new LM orchestration method keeps them synchronized, slashing error rates in a case study by over 50%.
RLVR, the dominant training paradigm for audio language models, may be turning them into unfeeling "answering machines" that excel on benchmarks but fail the vibe check.
Stuck training your reasoning model with RLVR due to a low initial success rate? This paper shows how a Tsallis q-logarithm loss can jumpstart learning by adaptively amplifying gradients, achieving a +14.4 point boost over GRPO on HotPotQA.
Decentralized debate among LLM agents doesn't just select the best solution for optimization modeling; it structurally enables agents to refine flawed candidates and even recover correct formulations through interaction.
Geometric Algebra offers a principled algebraic framework that captures higher-order semantic interactions, potentially resolving persistent limitations in compositional semantics and interpretability that plague current linear algebra-based NLP models.
Chain-of-Thought reasoning in Transformers hits a surprising expressivity ceiling when generalizing to longer sequences, unless you let your vocabulary grow with the problem size and use "signpost" tokens.
Unstructured pruning isn't just about shrinking LLMs; it can actually *boost* their reasoning abilities during test-time scaling, outperforming even the full, unpruned models.
Checkpointing and resuming are the unsung heroes of long-horizon LLM agent tasks, preventing failures where other sophisticated mechanisms only improve trajectory discipline.
Achieve coherent and scalable RPG world generation by explicitly modeling narrative dependencies between LLM prompts.
Current MBSE models are failing to leverage the full potential of AI, demanding a fundamental shift towards co-designing models and methodologies that prioritize machine-queryability.
Plug-and-play multi-agent systems are now a reality: OxyGent's "Lego-like" abstraction lets you compose agents, tools, and LLMs into scalable systems with unprecedented observability and evolvability.
Decoupling the "Thinker" from the "Editor" in image editing allows targeted optimization of reasoning, leading to performance competitive with strong proprietary models using a fixed generative model.
Current VLMs ace diagram question answering, but DRAGON reveals they often fake it, failing to ground their answers in the actual visual evidence.
Ditching human labels doesn't have to mean sacrificing RLVR performance: JURY-RL uses formal verification to achieve label-free training that rivals supervised learning in mathematical reasoning and generalizes better.
Diffusion models can now reason recursively over visual tokens, achieving state-of-the-art image generation performance by dynamically selecting specialized neural modules at each diffusion step.
Forget fine-tuning every LLM: ReQueR trains a single, RL-powered query refiner that coaxes hidden reasoning abilities out of diverse, frozen models at inference time.
LLMs struggle with clinical trial reasoning due to implicit planning assumptions, but a multi-LLM planner that explicitly decomposes the task into structured steps significantly improves accuracy and efficiency.
Humans are softies: AI agents can learn to win more by being more aggressive in negotiations, outperforming human players in a mixed-motive game.
Watermarking LLMs by embedding the signal into the reasoning process itself proves surprisingly robust against fine-tuning and other post-training modifications.
LLMs can nail the final answer in code execution but still fail to reason about the steps to get there, exposing a critical flaw in current evaluation methods.
VLMs hallucinate less when you force them to "think twice" by contrasting language-driven and vision-driven token probabilities at each decoding step.
LLMs struggle with e-commerce search relevance not because of reasoning limitations, but because they lack domain-specific knowledge, a problem K-CARE solves with external knowledge grounding.
Today's best multimodal LLMs still struggle to grasp fine-grained details and reason across multiple entities in images, even with access to external knowledge.
MLLMs are better at understanding videos than directly grounding text queries within them, and a self-correction training loop can close the gap.
Looping language models isn't just for single agents anymore: Recursive Multi-Agent Systems (RecursiveMAS) show that agent collaboration itself can be scaled through recursion, yielding faster and more efficient problem-solving.
LLM agents can better discover and assess risks of skills when those skills are represented in a structured format that explicitly represents scheduling, execution structure, and logic, rather than relying on unstructured text.
Decomposing image editing tasks into meta-tasks and aligning model reasoning with editing behavior unlocks surprising generalization to unseen editing operations.
LLMs can parrot CAN bus data, but CAN-QA reveals they fail at the temporal reasoning and multi-condition inference needed for real-world vehicle security forensics.
Forget expensive per-task search: agentic workflows can be synthesized in a single LLM pass by transferring learned structural priors, slashing optimization costs by 3 orders of magnitude.
LLMs harbor surprisingly nuanced and pervasive mental health stigma, revealed only by dissecting their reasoning steps, not just their final answers.
RL's superior generalization isn't about brute force, but about carefully sculpting a few key features while preserving the base model's knowledge, unlike SFT's rapid specialization.
LLMs fail to reliably track source trustworthiness in Turkish evidential marking, unlike humans, highlighting a critical gap in their ability to perform nuanced reasoning based on source reliability.
LLMs can now formalize natural language with significantly higher fidelity, thanks to a clever roundtrip verification method that self-diagnoses and repairs translation errors.
Small language models can achieve reasoning performance rivaling larger models, even under tight token budgets, by using a lightweight "guidance track" to strategically prune and refine their chain-of-thought reasoning.
Dependency-controlled context and explicit evidence sufficiency criteria are key to preventing premature stopping and improving the consistency of enterprise research outputs.
LLMs still can't pass history class: even state-of-the-art models struggle with complex historical reasoning, as revealed by a new benchmark based on the Chinese Imperial Examination.
Scaling up LLMs doesn't guarantee expertise: Korean-specific models beat larger global models on a new meteorology benchmark, exposing critical gaps in multimodal reasoning and cultural understanding.
Building AI tutors in the real world is hard: outdated tech, spotty internet, and curriculum gaps can derail even the best-designed systems.
Quantum education gets a boost: specialized LLM agents in a classroom setting not only improve tutoring reliability but also reveal hidden curriculum gaps.
LLMs can now audit cross-chain smart contracts with expert-level precision, achieving 95% coverage of vulnerable projects by explicitly mirroring human reasoning processes.
LLMs can find and fix bugs in complex codebases far better when structured as a team of reasoning agents, outperforming existing methods by a large margin.
Separating geometry from logic with fuzzy path constraints yields motion planning specifications that are both more intuitive for humans and more amenable to learning from demonstrations.
GraphRAG's black-box reasoning gets a spotlight: XGRAG reveals how specific knowledge graph components influence LLM outputs, boosting explanation quality by 14.81% over standard RAG explainability methods.
VLMs can be taught to self-correct hallucinations at the token level, leading to substantial gains in reasoning accuracy across diverse benchmarks.
Stop relying on LLMs to "hallucinate" reasoning paths – SEARCH-R uses a fine-tuned Llama3.1-8B model and dependency tree-based retrieval to navigate multi-hop question answering more reliably.
LLMs can't handle the truth: SLIDERS beats GPT-4.1 on long-context QA by sidestepping the context window entirely.
Highlighting pivotal evidence can boost LLM performance without altering the original context, leading to substantial improvements in reasoning tasks.