Search papers, labs, and topics across Lattice.
100 papers published across 7 labs.
Even closely related microcontrollers exhibit drastically different SRAM PUF performance under varying temperatures, underscoring the need for careful hardware selection.
Unlock geometric algebra's performance potential in neural networks and spatial computing by compiling directly from multi-way relationships, eliminating manual specialization and ensuring geometric correctness.
Seemingly idle LLM inference fleets can be secretly broken, and this simulator helps you find out why before you buy.
Multi-party function secret sharing just got a whole lot more practical: a new DDH-based scheme slashes key sizes by up to 10x.
LLM GPU fleets can be analytically optimized into a two-pool architecture with gateway-layer compression, slashing costs by up to 82% without sacrificing latency.
Pre-trained models unlock surprisingly aggressive quantization in federated learning, slashing communication costs by 40% without sacrificing accuracy on MNIST and CIFAR-100.
Sedna, a promising consensus protocol, is surprisingly vulnerable to cartel attacks that can stall block production and extract MEV, but a clever bounty mechanism can restore its security.
Network coding, often overlooked in robotics, can drastically improve the reliability and timeliness of multi-robot communication, outperforming traditional retransmission methods in safety-critical scenarios.
Quantum computers could finally unlock the full potential of machine learning for drug discovery by directly generating the quantum chemistry data that classical computers struggle to produce.
Federated recommendation systems can now better adapt to evolving user preferences without sacrificing privacy, thanks to a novel approach that retains historical knowledge and transfers insights between similar users.
YouTube's platform defenses are a house of cards: circumventing one control often triggers a cascade of failures, demanding constant architectural adaptation for large-scale content replication.
Ergodic control lets swarms of robots cooperatively manufacture micro-patterned surfaces, unlocking scalable production of materials with enhanced physical properties.
Forget buying new GPUs – clever context-length routing can boost your LLM inference energy efficiency by 2.5x, dwarfing the 1.7x gain from upgrading to a B200.
Automatically tracking causality across actors exposes hidden behavioral violations in real-world Erlang systems, without requiring manual code modifications.
NNVMC's promise for solving quantum many-body problems is currently bottlenecked by surprisingly mundane issues: low-intensity elementwise operations and data movement on GPUs.
Achieve up to 2.4x speedup over OpenBLAS on RISC-V by using MLIR and xDSL to generate optimized RVV code, finally unlocking the potential of RISC-V vector extensions.
Forget painstakingly tuning quantization for each LLM – RAMP learns a quantization policy that generalizes across architectures, often outperforming target-specific training.
Lossless compression can actually *speed up* LLM inference on GPUs, not just shrink model size, thanks to ZipServ's hardware-aware design.
Unlock geometric algebra's performance potential in neural networks and spatial computing by compiling directly from multi-way relationships, eliminating manual specialization and ensuring geometric correctness.
Forget centralized control: this algorithm lets swarms of robots build complex shapes with only local communication and no global positioning.
Achieve significant latency and energy savings in memory systems with an RL-based controller that also provides insights into *why* its decisions are optimal.
Ditch backprop's limitations: this synthesizable RTL implementation brings predictive coding networks to life in fully distributed hardware.
Running robotic manipulation workloads entirely onboard kills robot batteries, but offloading to the cloud tanks accuracy due to network latency, revealing a critical compute placement trade-off.
SpiderCam shatters power consumption barriers for FPGA-based 3D cameras, achieving sub-Watt operation while maintaining real-time performance.
By federating distributional critics and using a Wasserstein barycenter trust region, TR-FedDistRL avoids the dangerous "mean-smearing" that can make federated RL unsafe in critical applications.
Independent sampling of graph partitions is now a practical alternative to MCMC, offering a new path for generating diverse redistricting plans.
Secure enclave updates and migrations, previously missing from RISC-V TEEs, are now practical thanks to a novel toolkit that adds minimal overhead.
Multi-party function secret sharing just got a whole lot more practical: a new DDH-based scheme slashes key sizes by up to 10x.
Finally, a software energy profiler achieves both high accuracy and cross-platform portability, enabling practical algorithmic energy optimization across diverse languages and hardware.
Ditch the polar decomposition: MUD offers a surprisingly simple and efficient alternative for momentum whitening, speeding up transformer training by up to 50% compared to AdamW and Muon.
Even without architectural modifications, a new gradient inversion attack, ARES, can reconstruct high-fidelity training samples in federated learning, exposing a significant privacy risk.
Reproducibility in hardware reverse engineering is shockingly low, with only 4% of evaluated artifacts from 187 papers yielding reproducible results.
Federated Computing as Code lets you enforce data sovereignty in federated systems with cryptographic guarantees, moving beyond runtime policies and trust assumptions.
LLM serving systems can boost Time-To-First-Token (TTFT) attainment by up to 2.4x simply by prioritizing network flows based on a novel approximation of Least-Laxity-First scheduling.
Forget slow, single-SSD paging: Swarm unlocks 2.7x higher bandwidth for LLM KV-cache offloading by exploiting stable co-activation patterns to parallelize I/O across multiple SSDs.
ROS 2's real-time performance gets a major boost with ReDAG-RT, a user-space scheduler that cuts deadline misses by up to 30% without touching the core ROS 2 API.
Biased compression, previously overlooked in distributed learning with gradient coding, can actually boost performance when combined with error feedback to mitigate straggler effects and reduce communication costs.
Forget wrestling with 5G/6G testbeds – Plaza6G lets you design and run wireless experiments with natural language, thanks to an LLM-powered assistant.
Fine-tune 123B+ parameter models on a single RTX 4090 with SlideFormer, a system that achieves up to 6x larger models and 8x larger batch sizes.
Achieve sub-microsecond decoding-feedback latency in a scalable, open-source QEC system, bringing fault-tolerant quantum computation closer to reality.
Achieve near-linear scaling and 40x speedup for MP2 calculations on large molecules by unleashing multi-GPU parallelism for local correlation methods.
Visual SLAM loop closure just got a whole lot faster: FastLoop achieves up to 3x speedups by unleashing the power of GPU parallelism.
Seemingly idle LLM inference fleets can be secretly broken, and this simulator helps you find out why before you buy.
An existing debugging tool, the Arm Embedded Trace Macrocell (ETM), can be surprisingly repurposed to create a portable and effective hardware-assisted memory bandwidth regulator.
Resource consumption vulnerabilities in LLMs can lead to degraded service availability and economic sustainability, demanding a systematic understanding and mitigation approach.
LLM GPU fleets can be analytically optimized into a two-pool architecture with gateway-layer compression, slashing costs by up to 82% without sacrificing latency.
A novel DRL approach can extend XR device battery life by 163% without sacrificing real-time responsiveness, offering a practical solution to the energy-latency trade-off in immersive applications.
Forget stiff, piecewise designs: this soft robot arm achieves 4x faster dynamic task execution than previous approaches, proving that high-performance control and full compliance *can* coexist.
GitOps can transform CTF management, enabling automated deployments, enhanced collaboration, and cost-effective scaling.
A serverless, peer-to-peer messaging system achieves end-to-end encryption and data minimization, demonstrating a practical alternative to centralized messaging platforms.
Hooking the filesystem-specific `xfs_file_open` callback in ROFBS can significantly reduce ransomware damage on XFS filesystems, outperforming other generic file-open hooks.
A novel MARL algorithm, DS-PPO, enables multi-satellite systems to maximize user sum-rate despite outdated channel state information, offering a practical solution for robust global connectivity.
Forget hand-tuned defenses: a meta-learned aggregation strategy dynamically shields federated learning from a wide range of Byzantine attacks, even ones it's never seen before.
Forget relying on pretrained models or complex aggregation schemes: FederatedFactory achieves near-centralized performance in federated learning with extreme data heterogeneity by simply swapping generative priors.
A pragma-based OpenACC acceleration strategy delivers a 5x speedup and 3x energy reduction for the ECsim Particle-In-Cell code, proving its readiness for exascale plasma simulations.
Resource-constrained Arabic AI development can compete with systems built at far greater scale, as demonstrated by Fanar 2.0's performance gains using 8x fewer pre-training tokens than its predecessor.
Achieve energy-consistent parallel simulations of robotic systems with provable passivity guarantees, even with limited computational resources, by using a novel iterative coupling scheme.
Stop wasting precious GPU memory: this new cache-semantic hash table library achieves up to 3.9 billion key-value lookups per second, outperforming standard approaches by up to 9.4x.
A new 32B code LLM trained specifically for industrial tasks crushes existing models on specialized domains like chip design and GPU kernel optimization, while remaining competitive on general coding benchmarks.
Forget kinematic tree approximations: Kamino unlocks high-fidelity, massively parallel robot simulations with closed kinematic chains directly on GPUs.
Enterprises can regain control over network access in the age of MAC address randomization using a RADIUS-based framework that maintains persistent device identity without OS modifications.
A simple orthogonal rotation of the activation space makes LLMs virtually immune to bit-flip attacks, even against targeted single-point faults.
Always-on hardware Trojans leave persistent statistical signatures in EM emissions that can be detected without a golden reference, even differentiating between workload-correlated and independent Trojans.
IC verification just got a whole lot easier: SAMSEM can segment metal lines in SEM images with surprisingly low error rates, even on unseen ICs.
Vectorizing Verilog designs slashes memory consumption by over 50% in formal verification, even without changing the underlying hardware.
UAV swarms can achieve near-optimal cooperative deployment and generalize to new team sizes using a communication-aware MARL approach, even with limited communication and partial observability.
Parallelizing sequential computations like RNNs is now more feasible thanks to new scalable and stable parallel Newton methods, along with a theoretical understanding of when such parallelization provably accelerates computation.
Blindly applying GPU optimizations to homomorphic encryption can leave nearly 2x performance on the table, as the best strategy hinges on CKKS parameters and GPU architecture.
Replay-driven validation slashes CPU-GPU integration time in chiplet architectures, enabling full system boot and workload execution in a single quarter.
Binary neural networks can now be trained effectively in federated settings, offering a path to low-cost, privacy-preserving edge inference without sacrificing accuracy.
Inference time can reveal the GPU models behind black-box LLM APIs, offering a way to estimate their hidden energy costs.
Sampling the wrong data in differentially private queries can inflate error by 10x, but a new method slashes that overhead by sampling aggregation units instead of users.
Even closely related microcontrollers exhibit drastically different SRAM PUF performance under varying temperatures, underscoring the need for careful hardware selection.
Now you can predict the structure of biomolecular assemblies exceeding 30,000 residues, thanks to a new context parallelism framework that shatters previous memory constraints.
Federated reinforcement learning can now handle heterogeneous, adversarial IoT environments with near-zero deadline violations, thanks to a novel decentralized framework that transfers knowledge across silos.
Worried about compromised cloud environments skewing your endpoint auditing? vCause offers a verifiable causality analysis system with negligible overhead.
Forget complex combinators: a simple multiplication trick can slash LLM latency by 92% and boost throughput by 21%, outperforming production schedulers.
Overcome resource constraints in federated learning by enabling clients to train spiking neural networks of varying sizes and aggregate their knowledge effectively.
Achieve faster, Byzantine-robust distributed learning by combining double momentum with variance reduction, eliminating the need for large batch sizes.
Achieve up to 50% energy savings and 80% latency reduction in edge-based object detection by intelligently balancing load across heterogeneous devices, even with a minor accuracy trade-off.
For spacecraft-bound neural networks, a new bit-serial matrix multiplication accelerator, bitSMM, delivers impressive GOPS/W on both FPGA and ASIC, promising efficient on-board inference.
Achieve near-ideal GPU sharing without kernel hacks: DetShare guarantees semantic and performance determinism through GPU coroutines and lightweight context migration.
Cuckoo filters on GPUs can now achieve performance rivaling append-only Bloom filters, thanks to a novel lock-free architecture and memory access optimization strategy that closes the gap between static and dynamic approximate membership query structures.
Multi-agent LLM systems can slash synchronization costs by up to 95% by borrowing cache coherence strategies from chip design.
LLMs can run up to 35% faster on chiplet architectures thanks to a new lossless exponent compression technique that slashes inter-chiplet communication overhead.
Interpretable machine learning unlocks holistic, data-driven design of SSDs, enabling continuous architectural advancements across memory generations.
LLMs can now scale depth more effectively: a new attention mechanism recovers diluted features in deeper layers, boosting performance with negligible overhead.
Exact sampling in large-vocabulary decoding can be sped up by 19% simply by fusing it into the LM-head matmul, turning a bandwidth bottleneck into a lightweight epilogue.
Database tuning just got easier: DOT dynamically identifies and optimizes key parameters on-the-fly, outperforming existing methods without the need for costly warm-up phases.
Forget exotic attention mechanisms – MobileLLM-Flash achieves up to 1.8x faster LLM prefill on mobile CPUs by smartly pruning and adapting existing architectures for on-device use.
Squeezing federated learning through bandwidth-constrained networks? This routing and pruning method boosts accuracy by 12% while slashing latency by 28%.
MONET reveals the potential for significant hardware architecture improvements by modeling and optimizing neural network training, a domain often overshadowed by inference-centric design.
SALT offers a surprisingly effective way to personalize and harden split computing models in closed environments, using a lightweight adapter that outperforms full fine-tuning while slashing training costs.
Cykas lets long-running distributed jobs start and end sooner by cleverly shifting causal delivery enforcement from senders to receivers.
FPGAs can beat GPUs at dynamically allocating computation for LLM inference, thanks to a new architecture that fuses operations, uses mixed precision, and caches KV values on-chip.
Neuromorphic systems can achieve deterministic computation despite temporal stochasticity by enforcing charge conservation, enabling a direct mapping to quantized ANNs.
Hybrid Mamba-Transformer models can get 4x faster time to first token and 1.4x higher throughput by disaggregating prefill and decode phases onto specialized accelerator packages.
Twin-field QKD slashes the infrastructure complexity of quantum-secured blockchains from quadratic to linear scaling, paving the way for practical, long-distance deployments.
Domain skew in federated learning can be tamed by decoupling and calibrating domain-specific features, leading to more consistent and generalizable global models.
Oblivis enables practical, privacy-preserving database queries in cloud and edge settings, achieving up to 10^6x speedups over standard Oblivious Transfer methods.
Stop wasting compute: Sharing KV caches across tasks and time can make Vision-Language-Action models run 3.7x faster.
CacheLib, a popular caching engine, buckles under dynamic multi-tenant workloads, revealing critical limitations in adaptability and fairness that demand a rethink of its design.
Rule-based electromigration checks are no longer sufficient; physics-based models are ready for prime time, but several open problems must be solved to enable their practical adoption in integrated circuit design.
Optimizing committee configurations with mixed integer programming can boost transaction throughput in trusted parallel BFT systems by up to 21%, outperforming randomized assignment.
Achieve near-optimal power-efficient deep learning inference on edge devices without the need for expensive and repeated offline profiling, thanks to a novel online optimization method.