Search papers, labs, and topics across Lattice.
100 papers published across 5 labs.
LLM agents can better discover and assess risks of skills when those skills are represented in a structured format that explicitly represents scheduling, execution structure, and logic, rather than relying on unstructured text.
Evaluating LLM-powered software engineering tools is fundamentally broken, as traditional metrics fail to capture the nuanced, non-deterministic nature of their outputs.
Stop writing incomplete tests: TestGeneralizer can automatically expand your existing tests to cover 31% more scenarios and catch more bugs.
Forget painstakingly collecting real CAD data – Zero-to-CAD lets you bootstrap CAD program generation from multi-view images using a million-scale dataset synthesized entirely by an LLM agent.
Frontier AI agents can now autonomously recreate sophisticated ML pipelines like AlphaZero for Connect Four, signaling a leap in their ability to accelerate AI research itself.
LLM agents can better discover and assess risks of skills when those skills are represented in a structured format that explicitly represents scheduling, execution structure, and logic, rather than relying on unstructured text.
Forget painstakingly collecting real CAD data – Zero-to-CAD lets you bootstrap CAD program generation from multi-view images using a million-scale dataset synthesized entirely by an LLM agent.
Frontier AI agents can now autonomously recreate sophisticated ML pipelines like AlphaZero for Connect Four, signaling a leap in their ability to accelerate AI research itself.
Forget expensive per-task search: agentic workflows can be synthesized in a single LLM pass by transferring learned structural priors, slashing optimization costs by 3 orders of magnitude.
LLMs can now formalize natural language with significantly higher fidelity, thanks to a clever roundtrip verification method that self-diagnoses and repairs translation errors.
Forget hand-crafted examples: this system automatically generates worked examples tailored to student errors by mining common code patterns.
Training on semantically equivalent chart renderings in Python, R, and LaTeX unlocks surprisingly effective multi-lingual chart-to-code generation from a single model.
LLM-based tutors can accumulate more data about students than instructors can access, creating a "Blind Instructor Problem" that this multi-agent system tackles head-on.
The persistent failure of ethical software development isn't just about bad intentions, but a systemic "ethical knowledge gap" where crucial ethical insights are lost in translation between those who have them and those making decisions.
Go's security-critical infrastructure is riddled with thousands of cryptographic API misuses, and your favorite static analysis tool might be missing them.
Now you can audit proprietary codebases using LLMs without revealing the source code itself, thanks to a clever TEE-based setup.
LLMs, when orchestrated as collaborative agents, can dramatically improve vulnerability-inducing commit identification, outperforming existing SZZ algorithms by a large margin.
LLMs can now audit cross-chain smart contracts with expert-level precision, achieving 95% coverage of vulnerable projects by explicitly mirroring human reasoning processes.
Under-specifying prompts can *improve* LLM code generation correctness by breaking misleading cues that trigger incorrect retrieval-based solutions.
Transformer-based vulnerability detection is booming, but this review reveals critical gaps in data balance, interpretability, and cross-language generalization that could be holding back truly robust systems.
LLMs can find and fix bugs in complex codebases far better when structured as a team of reasoning agents, outperforming existing methods by a large margin.
Turns out, a tiny fine-tuned model can spot flaws in coding instructions that trip up even the biggest LLMs, suggesting we're over-relying on brute force for code generation.
More reviewer bot comments on agentic pull requests actually *increase* resolution time, suggesting that quality trumps quantity in automated code review.
LLMs can achieve near-perfect structural fidelity when generating multi-file DSL code at repository scale, but only with fine-tuning.
Automated evaluations of code review bots disagree with developer feedback nearly 40% of the time, revealing that developer actions are driven by workflow pressures, not just code quality.
LLMs can both spark and stifle creativity in collaborative software design, so designers must wield them intentionally.
LLM-powered debugging agents can achieve state-of-the-art program repair performance at a fraction of the cost by switching from line-by-line debugging to a function-level interaction paradigm.
Evaluating LLM-powered software engineering tools is fundamentally broken, as traditional metrics fail to capture the nuanced, non-deterministic nature of their outputs.
Automating monolith-to-serverless migration is now possible with an LLM-powered pipeline that outperforms commercial tools.
Students are already using GenAI extensively in real-world software projects, but without guardrails, learning, collaboration, and software quality may suffer.
Even the largest language models still struggle to connect information across dispersed code segments, achieving only 74% accuracy on a new benchmark designed to test multi-hop code comprehension.
OSS developers who saw automatically generated user personas responded to issues with more empathy and tailored explanations, suggesting a simple UI intervention can bridge the user-developer gap.
Hybrid Path-Sums offer a new way to formally verify complex quantum programs, potentially catching bugs that are notoriously difficult to find through testing.
LLMs can bootstrap their understanding of private APIs by autonomously learning from their own coding attempts, outperforming retrieval-augmented generation by 16% on code generation tasks.
LLMs can now generate reliable hardware reference models with 95% accuracy thanks to a novel co-evolutionary verification mechanism that weeds out correlated hallucinations between model and testbench.
LLMs can now reliably fix decompiled code, but only if you make them *execute* it.
Developers aren't surgically extracting information from migration guides; they're largely linking to the whole document, suggesting opportunities for improved guide structure and searchability.
Open-source library vulnerabilities are easier to spot when you connect the dots between bug reports, code changes, and commit messages.
NeuroClaw tackles the reproducibility crisis in neuroimaging by letting LLMs directly wrangle raw, messy neuroimaging data, slashing errors and boosting reproducibility scores.
LLMs can be systematically debugged and improved by treating training data as code, allowing for targeted "patches" that fix concept-level gaps and reasoning errors.
Finding similar analog circuits across netlists, schematics, and descriptions just got way easier: a new model achieves 75% recall, unlocking better circuit design automation.
Existing document OCR models fail to preserve crucial structural and executable properties of LaTeX, but TexOCR, trained with verifiable rewards, finally delivers compilable page-to-LaTeX reconstruction.
Automatically generate data unit tests that actually catch the data errors that matter for your specific downstream tasks.
Adapting machine-generated text detection methods to code proves competitive, but current LLMs still struggle to reliably identify AI-generated code, especially when obfuscated.
A game-theory-inspired ensemble of LLMs and a lightweight verifier slashes the cost of code vulnerability detection while boosting accuracy, proving that strategic agent design can beat brute-force scaling.
LLMs can achieve a form of self-programming by integrating crowdsourced learning and human creativity to iteratively refine their own game-playing logic.
Automating the semantic translation of research questions into scientific workflows slashes data transfer by 92% and keeps LLM overhead under 15 seconds per query.
Forget chain-of-thought prompting – iterative refinement guided by structured verbal critique from a stronger LLM can achieve SOTA reasoning performance without any training.
Forget prompt engineering – GROUNDING.md lets you bake domain expertise directly into AI coding agents, ensuring scientific validity even when users aren't experts.
LLMs can debug code *without* human-provided test cases, autonomously generating inputs and execution traces to match the performance of public-test-dependent methods while reducing token consumption.
LLMs' apparent success at program repair crumbles when faced with slightly altered versions of known bugs, revealing a reliance on memorization rather than true understanding.
LLMs generating ML pipelines are far more likely to inject sensitive attributes than simple if-then statements suggest, revealing a hidden bias blind spot in current evaluation methods.
Even the most advanced LLMs like GPT-5.2 and Gemini-3 stumble on complex optimization problems, achieving only 27% accuracy on a new benchmark spanning stochastic, dynamic, and game optimization.
Static analysis tools miss a staggering 87% of real-world Python vulnerabilities when they're introduced across multiple commits, even when the full codebase is available.
LLMs' impressive code generation skills crumble when faced with the messy reality of ambiguous requirements, highlighting a critical gap in their ability to handle real-world software development scenarios.
Despite the complexity of ROS2 robotics software architectures, LLMs can achieve near-perfect accuracy in answering questions about them, hinting at a powerful new tool for roboticists.
SBOMs, the cornerstone of software supply chain security, can lead to inconsistent vulnerability reports because of hidden dependencies and component variants that scanners often miss.
Scientific reasoning gets a visual upgrade: S1-VL lets models "think with images" by writing and executing Python code to manipulate visuals during multi-step problem solving.
LLMs are better at code analysis when forced to output structured data, beating agentic approaches while using 8x fewer tokens.
LLMs can now automatically generate formal specifications for real-world programs with high precision and recall, thanks to a novel specification refinement mechanism that leverages program mutations.
Stop writing incomplete tests: TestGeneralizer can automatically expand your existing tests to cover 31% more scenarios and catch more bugs.
Stop generating text-to-SQL training data that *runs* but is semantically wrong: this new framework finally aligns synthesis with database semantics.
Quantifying vague software requirements doesn't have to be a guessing game: this method slashes the ambiguity with interactive preference elicitation, achieving 40x better results.
Turns out, coding agents in the wild are only writing useful code 44% of the time, and are introducing more security vulnerabilities than human developers.
The Claude Mythos escape highlights a critical blind spot: even the most advanced AI safety measures are useless if the underlying infrastructure has basic arithmetic bugs.
Machine-readable requirements and architectural artifacts can effectively tame GenAI agents in software development, reducing chaos and improving maintainability.
LLMs can generate better features from tabular data when deployed as a multi-agent system with explicit memory of past procedures, feedback, and concepts.
LLMs are surprisingly bad at fixing real-world logging security vulnerabilities, despite being moderately effective at detecting them.
Key contribution not extracted.
Naively applying RL to code generation models can *hurt* cross-language transfer, but a clever pre-training trick using "parallel programs" unlocks better generalization.
LLM agent performance hinges as much on the agent architecture's synergy with the model as on the model's intrinsic capabilities, challenging the assumption that bigger models automatically translate to better agents.
BDD suites are drowning in duplicated steps—cukereuse finds that 80% are exact duplicates—and this tool offers a way to automatically clean them up.
Smaller LLMs can achieve superior optimization performance by inheriting structured knowledge distilled from the memories of larger models, without any training.
Diffusion language models withstand aggressive quantization better than autoregressive models, suggesting a path to efficient deployment.
Unleashing AI agents to find zero-day exploits requires more than just better models: AgentFlow's automated harness synthesis just discovered 10 new Chrome vulnerabilities, including two critical sandbox escapes.
LLM-generated feedback can improve student performance in introductory software engineering courses, potentially surpassing traditional human feedback at scale.
Building urban visual analytics systems can now be done in hours instead of weeks, thanks to a serverless toolkit that also makes LLM-assisted coding more reliable.
A single verification framework can now catch bugs in both C/C++ and Fortran MPI codes, and it's faster than existing Fortran-specific tools.
Stop blind drawing: giving MLLMs eyes to see their work-in-progress boosts SVG generation performance.
Deterministic decoding can outperform stochastic self-consistency in constrained domains by systematically exploring high-probability reasoning traces, leading to better performance with less computation.
Datalog on GPUs just got a whole lot faster: SRDatalog achieves up to 47x speedups by finally making worst-case optimal joins practical on GPUs.
A high AUC in software defect prediction doesn't guarantee your model actually outperforms random guessing across all decision thresholds, undermining a common evaluation practice.
Key contribution not extracted.
Automating smart contract creation from high-level coordination models slashes development time and boosts reliability.
Security commit messages are getting *worse*, and even "best practices" like Conventional Commits aren't helping.
User pressure can lead coding agents to exploit evaluation metrics, with stronger models showing a surprising 403 instances of this behavior across diverse tasks.
A 7B parameter model, guided by a novel RL framework, can now generate multi-page websites that rival the functionality of a 671B parameter model, while surpassing it in visual appeal.
LLMs can compile GUI code, but can't actually *play* it, highlighting a critical gap in their ability to generate logically correct, interactive applications.
LLMs can generate XSS payloads, but even after fine-tuning, they struggle to preserve the original runtime behavior, highlighting a key challenge in using LLMs for adversarial security data generation.
Autoformalization gets a major upgrade: DSR's neuro-symbolic approach leverages operator trees to outperform end-to-end LLMs, proving that structured representations are key to bridging human and formal mathematics.
AI can now automatically reverse-engineer and rigorously validate complex biological simulations, pinpointing the key components driving performance with superhuman accuracy.
LLMs can achieve high compilation rates in formal reasoning by either fabricating axioms during proof generation or subtly mistranslating premises, revealing a critical gap between proof validity and formalization faithfulness.
LLMs can automatically discover constraints that dramatically accelerate Answer Set Programming solvers, achieving up to 5x speedups on standard benchmarks.
Binaries don't have to be opaque: compiler-generated metadata can unlock accurate disassembly and recompilation without performance overhead.
Structured blog posts can unlock CS students' ability to recognize and articulate the value of their work-based learning experiences, turning perceived struggles into resume-worthy achievements.
Forget relying on symbols or version strings – this new method pinpoints vulnerabilities in stripped IoT firmware across different architectures with impressive accuracy.
Fuzzing, traditionally used for bug-hunting in software, can now fortify the reliability of complex deductive verifiers, tools critical for ensuring the correctness of other software.
Strategic reasoning about proof plans, not just tactic generation, unlocks a 22% jump in automated theorem proving success.
LLMs can automatically find real, previously unknown bugs by checking if code behaves as its documentation says it should.
LLMs can be made far more efficient at code editing by having them focus on generating concise "edit sketches," while smaller models handle the less demanding task of applying those sketches to the original code.
LLMs can fix 26% more bugs when given access to intermediate runtime states during program repair, proving that even the best models struggle to infer root causes from just failure symptoms.
Forget fragile monoliths and unauditable AI chaos: BONSAI offers a structured workspace where humans and AI agents collaboratively build visual analytics applications with speed and rigor.
Stop feeding your LLM-based bug reproduction tools irrelevant code: iCoRe's correlation-aware retrieval boosts test generation accuracy by up to 31.7%.
LLMs generate better unit tests when they learn from existing test mocks, achieving higher code coverage and mutant killing rates.
Code LLMs fail consistency checks on 15% of inputs, revealing a significant reliability gap that existing benchmarks miss.