Search papers, labs, and topics across Lattice.
100 papers published across 3 labs.
Functional logic programs can be efficiently implemented in purely functional languages like Haskell, achieving performance gains over existing Curry compilers by using a novel monadic interface with memoization.
Meta's risk assessment of its Code World Model (CWM) gives it a clean bill of health, concluding it poses no *new* catastrophic risks beyond those already present in the AI landscape.
Current code reward models are myopic, mostly rewarding functional correctness, but Themis-RM learns to score code across multiple criteria and languages, opening the door to more nuanced and useful code generation.
LLMs can now generate 70% syntactically correct and geometrically consistent 3D objects from text, thanks to retrieval-augmented code synthesis.
Transformers struggle to extrapolate to syntactically novel programs in program synthesis, even with significant compute scaling, suggesting current approaches are bottlenecked by a lack of training diversity.
Meta's risk assessment of its Code World Model (CWM) gives it a clean bill of health, concluding it poses no *new* catastrophic risks beyond those already present in the AI landscape.
Current code reward models are myopic, mostly rewarding functional correctness, but Themis-RM learns to score code across multiple criteria and languages, opening the door to more nuanced and useful code generation.
LLMs can now generate 70% syntactically correct and geometrically consistent 3D objects from text, thanks to retrieval-augmented code synthesis.
Transformers struggle to extrapolate to syntactically novel programs in program synthesis, even with significant compute scaling, suggesting current approaches are bottlenecked by a lack of training diversity.
LLMs can synthesize formal safety rules from natural language goals, offering a path to more robust and verifiable AI systems in safety-critical domains.
LLMs struggle to complete RTL code, and their performance hinges on the grammatical structure of the missing code and the prompting strategy used.
LLMs can learn to safely leverage external memory for code debugging by explicitly modeling and penalizing the risk of false-positive memory injection.
Template engine bugs often manifest as silent failures with unexpected or blank outputs, and fixing them frequently requires changes to host-side logic, not just the template itself.
LLMs still can't reliably reverse engineer stripped binaries, and REBench offers a standardized, fair-by-construction benchmark to finally measure progress.
Achieve near-perfect attribution of Android residential proxy malware by fusing graph kernel features with binary capabilities, even amidst code reuse and obfuscation.
Angular apps are riddled with hidden design flaws: this study surfaces 11 common "code smells" and shows how to automatically sniff them out.
Newcomers beware: the odds of your "good first issue" pull request getting merged have plummeted nearly 20% in the last year.
Functional logic programs can be efficiently implemented in purely functional languages like Haskell, achieving performance gains over existing Curry compilers by using a novel monadic interface with memoization.
Recovering type information from untyped GPU register files is the key to enabling effective binary analysis, unlocking reverse engineering and security analysis of proprietary GPU code.
Today's best multimodal agents still fall into "blind execution" traps when building websites from ambiguous, non-expert user instructions, highlighting a critical gap in intent recognition and adaptive interaction.
Forget learning to answer – ANCORA shows language models can master verifiable reasoning by learning to *question* themselves.
Domain-adapting LLMs for EDA requires explicit RAG scenario training to prevent performance degradation, and QA augmentation during corpus construction further boosts performance.
Real-world Text-to-SQL systems can now be continuously evaluated and improved in production, even without access to database schemas or ground-truth queries.
Popular terminal-agent benchmarks are riddled with flaws, with over 15% of tasks being easily reward-hackable, undermining their ability to accurately assess LLM capabilities.
Text-to-SQL models can get a 36% accuracy boost and run 2.2x faster by exploiting the predictable patterns in real-world query workloads.
Domain knowledge, usually helpful, can actually *hurt* LLMs tackling complex engineering design modularization, revealing a fundamental tension between semantic priors and structural optimization.
MLLMs can ace circuit-to-code generation by cheating with identifier semantics, even when the circuit diagram is blank.
General-purpose coding agents may ace scientific visualization tasks, but their computational cost is a steep price compared to the efficiency of domain-specific agents, highlighting a crucial trade-off in LLM agent design.
LLMs can achieve robust nonmonotonic reasoning across diverse tasks without task-specific engineering, simply by iteratively self-correcting based on feedback from an ASP solver.
Despite advances in LLMs, even syntactically correct outputs often fail to achieve the intended state transitions when translating natural language into executable Ethereum transactions, revealing a critical gap in "reasoning-to-execution" capabilities.
Automating the translation of economic intuitions into executable computational experiments is now possible, potentially accelerating the pace of economic research.
LLMs can now reliably generate IC verification testbenches, not by writing HDL directly, but by orchestrating a novel hybrid approach that combines LLM-driven planning with template-based HDL generation.
LLMs trained with ScaleBox, a new high-fidelity code verification system, substantially outperform those trained with heuristic matching, suggesting current RLHF methods are bottlenecked by verification quality.
LLMs can edit code 30% faster and cheaper without sacrificing accuracy, simply by learning to choose between generating full code and structure-aware diffs.
LLMs trained on raw code text learn surface-level cues that trigger false positives when detecting vulnerabilities in other languages, but simply feeding them ASTs at inference time can dramatically reduce these errors.
You can steal secrets from locally fine-tuned LLMs by backdooring their model code, even bypassing common defenses like differential privacy and code audits.
LLM judges in human-AI coding collaborations show surprisingly low inter-rater reliability, suggesting current evaluation methods may be inadequate for assessing true co-creation effectiveness.
Code dataset watermarking gets a stealthy upgrade: PuzzleMark hides watermarks in variable names based on code complexity, making them nearly undetectable while guaranteeing perfect verification.
Forget end-to-end automation: Pragmos shows how LLMs can actually *improve* business process modeling by collaborating with humans in a structured, step-by-step workflow.
Turns out, even in the age of AI, good old-fashioned communication and teamwork are still the bedrock of successful agile software development.
"Utility" code, intended to be broadly useful and reusable, is actually 2.75x more likely to be involved in a vulnerability than other code.
Turns out, the best template for documenting architectural decisions depends on whether you value conciseness (Nygard) or structural detail (MADR).
Defining "hero developers" in open-source projects is more nuanced than previously thought: technical prowess doesn't guarantee social engagement, and vice versa, impacting bug-fixing success in surprising ways.
Applying traditional technology acceptance models like UTAUT to GenAI reveals critical gaps in our understanding of how software engineers perceive and adopt these transformative tools.
Mitigating long-tail distributions in code datasets boosts API recommendation reliability by up to 10% using an ensemble of models that strategically reject low-confidence predictions.
Automating CUTLASS kernel synthesis and auto-tuning lets you get 2.79x speedups on real models like MiniGPT just by having an LLM rewrite your PyTorch.
LLMs don't automatically win at study screening for software engineering SLRs: their performance is highly variable, sensitive to input data, and not consistently better than classical models.
LLMs can help find functionally identical smart contracts even when the original code lacks comments, opening the door to better vulnerability detection and code reuse.
Quantum programs can achieve seemingly high structural coverage, yet this bears little relation to their actual fault detection capability, echoing a cautionary tale from classical software testing.
Reproducibility issues plague over 20% of Defects4J, a widely used benchmark for automated program repair, casting doubt on the validity of many APR evaluations.
Replaying CI failures in embedded systems is now possible at scale: PhantomRun reconstructs over 90% of failing builds, opening the door to systematic debugging and failure analysis.
You can slash false positives in PyPI malware detection by 82% while simultaneously reducing feature dimensionality by 50% using a carefully tuned deep learning approach.
Security testing is fragmented: program analysis and adaptive testing operate largely in isolation, missing opportunities to leverage structural insights for more effective vulnerability detection.
AI agents and humans exhibit over 10 distinct repair behaviors when performing urgent hot fixes, suggesting opportunities for targeted human-automation collaboration.
Automated testing of dynamic resource management frameworks in HPC is now possible, catching faults earlier and simplifying maintenance.
Building agents that can reliably automate complex, multi-step workflows over local files and tools just got a whole lot easier.
Unlock 3x higher throughput in your data center by easily converting MPI applications to malleable jobs with a new library.
LLMs still struggle to generate complete, internally structured classes from specifications, with even the best models failing more than half the time on a new benchmark designed to avoid data contamination.
Task-specific LLMs aren't just smaller versions of general models; they rely on a small subset of neurons so critical that removing just 10% can completely break them.
Over-reliance on AI code generation isn't just making developers lazy, it's creating a dangerous "Epistemological Debt" that could trigger systemic software failures.
The rise of agentic AI coding systems doesn't spell the end for SaaS, but it *does* fundamentally alter the economics of building in-house, creating a hybrid governance model that blends code ownership with dependence on external AI infrastructure.
Defend against hardware Trojans in LLM-generated RTL code by structurally and semantically verifying training data, without needing to alter the underlying LLM.
Code-level security audits miss vulnerabilities arising from specification requirements, but SPECA finds them by reasoning directly from natural language specs.
Code stylometry, often overlooked, can significantly boost vulnerability detection, improving F1 scores by up to 48% on key benchmarks.
LLMs fail to generate secure cryptographic code the vast majority of the time, with 57% of compiled samples containing exploitable vulnerabilities like nonce reuse.
UK computer science grads may be over-indexed on database management while woefully unprepared for the software design and planning skills that industry actually needs.
Strict modular testing in EvoSuite tanks coverage, but relaxing target method isolation and prioritizing relevant call chains can boost coverage by 15%.
Forget LLMs, simple process metrics like code age and developer activity are the real MVPs for predicting bugs that slip into production Python code.
Forget hand-coded goals: these agents rewrite their own code and redefine their objectives on the fly, powered by LLMs.
CS education risks irrelevance if it continues to prioritize rote coding skills over the systems-level thinking needed to build and manage complex AI-driven systems.
Automated program repair still struggles in real-world CI environments, succeeding in less than 20% of cases, even with the best LLMs.
Post-release software bugs aren't just about code complexity; they're a symptom of code age, frequent modification, and high churn, demanding a shift in testing focus.
Stop letting your research code, theory, and documentation drift apart: a new LM orchestration method keeps them synchronized, slashing error rates in a case study by over 50%.
Resource-oriented smart contract languages like Move cut security code by 60%, suggesting a path to safer DeFi even if it means writing more code.
Unlock verification artifact reuse across languages by representing programs as typed, attributed graphs that capture both structure and semantics.
Enforcing classical test-driven development principles directly within prompt orchestration enables more reliable and reproducible code generation from LLMs.
Agentic AI has exploded in software engineering, achieving a 40x performance leap on SWE-bench in just 18 months, signaling a fundamental shift from code generation to AI-driven delegated execution.
LLMs in software engineering are mostly used for automation, not decision support, and suffer from reproducibility issues, revealing a critical gap in human-centered integration and transparency.
Forget hype, focus on human oversight: this study reveals practical, actionable recommendations for actually integrating LLMs into software development workflows responsibly.
Get significantly higher test coverage from your BDD scenarios by automatically translating them into formal models.
LLMs can now generate more complete and up-to-date code documentation 3x faster while using 85% fewer tokens, thanks to a novel knowledge graph representation of code repositories.
Smaller models get a bigger speed boost from Speculative Decoding on software engineering tasks, challenging the assumption that larger models always benefit more from inference acceleration techniques.
Stop manually juggling MBSE models and OCL constraints: this framework uses Asset Administration Shells to automate validation and interpretation.
Current AI models are surprisingly inept at real-world data visualization tasks, failing more than half the time on a new benchmark designed to mimic enterprise workflows.
SkillSynth's skill graph approach lets you explicitly control the diversity of execution trajectories during terminal task synthesis, leading to more effective agent training.
Slash your LLM's carbon footprint by up to 81% without sacrificing performance using a compression pipeline inspired by carbon taxation.
Decentralized debate among LLM agents doesn't just select the best solution for optimization modeling; it structurally enables agents to refine flawed candidates and even recover correct formulations through interaction.
Forget scaling laws: for code classification and vulnerability detection, the *right* code-specialized PLM matters more than GNN architecture or PLM size in PLM-GNN hybrids.
Transformer-based language models don't always win: simpler, TF-IDF-based models surprisingly outperform them in fault localization using industrial bug reports.
Multi-agent code editing with structured failure feedback boosts task success by 17%, suggesting a promising path to more reliable LLM-driven code manipulation.
LLM-generated API tests can be *less* effective when refined against faulty code, especially when requirements are vague, suggesting that blindly incorporating SUT behavior isn't always beneficial.
LLMs might sound good at designing networked systems, but they're surprisingly bad at avoiding configurations that violate basic constraints, highlighting the need for structured reasoning frameworks like Kepler.
Software vulnerability detection gets a serious upgrade: aligning code with developer comments boosts F1 scores by up to 27% compared to traditional code-only methods.
Current MBSE models are failing to leverage the full potential of AI, demanding a fundamental shift towards co-designing models and methodologies that prioritize machine-queryability.
LLMs can now automatically design and execute experiments to resolve debates between cognitive science theories, even discovering the models and experiments themselves.
Securing AI-native enterprise systems demands a shift from traditional software validation to dynamic formal verification of stochastic agent behavior, as demonstrated by a Semantic Gateway that uncovers 100% of unauthorized state transitions.
Evolving interpretable control policies for multi-task robots is now possible: MATPG leverages genetic programming to create a single agent that masters multiple continuous control tasks.
Agentic AI systems can confidently generate plausible but wrong scientific results, even when given domain-specific context, highlighting a critical challenge for their integration into research workflows.
Text-to-SQL models can now achieve significantly higher accuracy by grouping and ranking SQL candidates based on execution results, then strategically resampling when the initial pool is lacking.
Coding agents can now evolve their own harnesses to outperform human-designed ones, thanks to a novel observability-driven approach.
Students already believe foundational theory is relevant to their careers, so adding real-world examples may not be the best way to increase student buy-in.
Organizational coupling in microservices isn't just about architecture – it's heavily influenced by the "Connector" roles bridging organizational silos, suggesting targeted interventions are possible.
Unlock expert developer reasoning: a new dataset distills complex GitHub issue discussions into structured trajectories, revealing the collaborative problem-solving process behind open-source software.
LLMs can nail the final answer in code execution but still fail to reason about the steps to get there, exposing a critical flaw in current evaluation methods.
RSEs aren't just coders; a strong collective identity shapes their professional wellbeing, revealing a crucial social dimension in software engineering.