Search papers, labs, and topics across Lattice.
72 papers published across 1 lab.
Pythonistas rejoice: aggregate programming, a powerful paradigm for distributed systems, finally gets a first-class, easy-to-use library in your favorite language.
Automating detector design with AI can dramatically speed up scientific discovery by intelligently exploring complex parameter spaces.
LLMs' skewed matrix shapes need not hamstring systolic array performance: SISA's partitioned architecture achieves up to 8.52x speedup and 93% EDP reduction compared to monolithic arrays.
FL systems are far more vulnerable to backdoor attacks using realistic, semantically-aligned triggers (like sunglasses) than previously thought based on simple corner patches.
Commodity CPUs can be retrofitted with hardware-backed control flow attestation using hardware performance counters, enabling runtime attack detection in TEEs.
Pythonistas rejoice: aggregate programming, a powerful paradigm for distributed systems, finally gets a first-class, easy-to-use library in your favorite language.
Automating detector design with AI can dramatically speed up scientific discovery by intelligently exploring complex parameter spaces.
LLMs' skewed matrix shapes need not hamstring systolic array performance: SISA's partitioned architecture achieves up to 8.52x speedup and 93% EDP reduction compared to monolithic arrays.
FL systems are far more vulnerable to backdoor attacks using realistic, semantically-aligned triggers (like sunglasses) than previously thought based on simple corner patches.
Commodity CPUs can be retrofitted with hardware-backed control flow attestation using hardware performance counters, enabling runtime attack detection in TEEs.
Now, clients can actually *verify* that their data has been removed from a federated learning model, even when the server is untrusted.
Achieve structured IPC and practical message movement in modular services with CNS, a lightweight hybrid event fabric that bridges in-process and inter-node communication with minimal overhead.
Stop averaging prototypes blindly: FedDBP uses Fisher information to intelligently fuse local prototypes, significantly boosting performance in heterogeneous federated learning.
Guaranteeing safety in multi-agent systems with dynamic networks doesn't have to sacrifice performance: this plug-and-play protocol ensures recoverable safety even when agents join/leave or network topologies shift.
Achieve HPC acceleration by emulating FP64 operations with INT8 precision on GPUs, proving that you can boost performance *and* accuracy.
Quantum circuit compilation, a major bottleneck, can be sped up by over 15x with minimal overhead using a new parallelization technique validated on 8000 large-scale, configurable random circuits.
Datacenter simulations can now combine multiple independent models to better predict performance and climate impact, addressing limitations of single-model approaches.
Unexplained P99.9 latency spikes in Apache Pulsar could be due to a previously undocumented Linux kernel page cache writeback interaction inside BookKeeper's ForceWriteThread, even with dedicated NVMe drives.
Sometimes, knowing less (limiting computation to polynomial time) can let you decide *more* in distributed systems, especially with universal certificates.
Dataflow networks can achieve significant energy savings without sacrificing throughput by strategically powering down actors during idle periods, a balance efficiently discovered using a novel "Hop and Skip" exploration strategy.
Pinpointing performance bottlenecks in large-scale AI training just got 100x faster, thanks to a new system that watches the whole stack without slowing things down.
Finally, a gem5-integrated simulator that accurately models CXL memory expansion for LLMs, capturing real-world effects like cache pollution.
Achieve up to 4.17x speedup in DRL training by intelligently partitioning tasks across CPUs, FPGAs, and AI Engines on AMD Versal ACAP, demonstrating the power of hardware-aware algorithm design.
Unlock 600,000x faster TSV design by replacing computationally expensive full-wave simulations with physics-informed graph neural networks.
Calculating excited states of molecules with thousands of atoms, previously a computational bottleneck, is now practical on a single GPU thanks to a new implementation of TDDFT-risp.
Optimized LoRaWAN gateway placement hinges on the channel model used, with ray tracing offering higher fidelity but at a significant computational cost.
Unbalanced class prevalence, not just disjoint label sets, is the dominant factor hindering federated learning performance under label-space heterogeneity.
Compromised 5G networks can be weaponized with chained, undetectable command and control channels, enabling attacks that bypass existing security measures.
Second-order federated learning can be made robust and practical: FedRCO overcomes instability issues and outperforms first-order methods in non-IID settings.
Differentiable Power-Flow unlocks scalable, gradient-based optimization for power grid management, outperforming traditional methods and enabling new applications like real-time contingency analysis.
Federated learning can overcome data sparsity and privacy concerns to improve livestock growth prediction using real-world farm data.
Agentic RL rollouts are bottlenecked by long-tail trajectory generation, but Heddle's trajectory-centric approach achieves 2.5x higher throughput.
FedDES achieves instance-level personalization in federated learning by dynamically selecting and weighting peer models with a GNN, leading to significant performance gains in heterogeneous environments.
Guaranteeing robust distributed GenAI inference at the edge requires trust-aware routing, and G-TRAC achieves this with sub-millisecond routing latency.
Quantum-proofing your 5G core doesn't have to break the bank: a sidecar proxy can add post-quantum cryptography with a predictable 50ms latency hit.
Lightweight DisCNNs offer a surprisingly efficient route to object detection by exploiting monotonic relationships between network outputs and feature presence.
RDMA failover can be made significantly more efficient and correct by selectively retransmitting only the requests that were actually lost during a link failure, avoiding redundant retransmissions and semantic violations.
Squeezing loop control down to <10% of array resources unlocks near-zero-overhead parallel loop acceleration on Tightly Coupled Processor Arrays.
Forget CPUs and GPUs: MCPT-Solver uses spintronics and Bayesian inference to create a hardware random number generator that dramatically accelerates Monte Carlo particle transport simulations.
LLMs can now automatically evolve and optimize GPU kernels to beat hand-tuned and proprietary models like Gemini and Claude.
By cleverly repurposing an unused sign bit, IF4 achieves superior quantization performance compared to NVFP4 without increasing bit-width.
Forget slow rotations: IsoQuant's quaternion-based approach warps RotorQuant in LLM KV cache compression, delivering up to 6x speedups on synthetic data.
Blockchain-based federated learning can be made practical by using multi-task peer prediction to overcome the computational bottleneck of contribution measurement.
Bitcoin can be more than just digital gold: BitSov proposes a composable architecture for a censorship-resistant internet, anchored to Bitcoin's blockchain, that could reshape how we build decentralized applications.
Backdoor defenses can be baked into the pre-training phase of federated learning, achieving state-of-the-art attack mitigation with minimal impact on clean accuracy.
FedBBA slashes backdoor attack success rates to as low as 1.1% in federated learning, leaving existing defenses in the dust.
Achieve secure outsourced decision tree evaluation without any communication between servers, unlocking faster and more scalable MLaaS deployments.
Flow-matching generative models can simultaneously defend against poisoning attacks and preserve privacy in federated learning, outperforming existing methods in accuracy and robustness.
Pinpointing root causes in distributed systems just got easier: Lumos automatically exposes the computational history of bugs with low overhead, even with limited bug occurrences.
Forget hand-coding adapters: this middleware uses LLMs to automatically bridge REST APIs, GraphQL endpoints, and IoT devices with a 90% success rate.
Real-time 3D occupancy mapping for edge devices is now possible under a 6mW power budget thanks to Gleanmer, a novel SoC.
Cloud databases are leaving performance on the table: optimizing kernel-space I/O can yield up to 9x speedups without requiring kernel or database patches.
Securing and accelerating Slurm cluster access is now possible without rewriting existing tools, thanks to a lightweight proxy that adds granular permissions and caching.
LLM inference bottlenecks aren't just compute-bound: heterogeneous GPU-FPGA systems can slash memory processing overheads by up to 2x while simultaneously reducing energy consumption.
Distributed vertex coloring can now be solved in near-optimal $\tilde{O}(\log^4 \log n)$ rounds, closing the gap with the theoretical lower bound and exponentially improving performance for graphs with small maximum degree.
Deploying transformers in real-time just got a whole lot faster: this work achieves up to 64x speedups on GPUs while maintaining accuracy through a novel hybrid precision approach.
Forget fixed memory budgets: dynamically allocating exemplar storage across federated clients boosts performance in class-incremental learning for heterogeneous medical data.
Intra-warp load imbalance, a major bottleneck in GPU-accelerated Electronic Design Automation, can be eliminated through warp-level parallel orchestration, leading to significant speedups in static timing analysis.
Content-oblivious networks can count and simulate message passing far more efficiently than previously thought, shrinking the pulse complexity from $O(n^3)$ to $O(n \log^2 n)$ for counting and $O(b)$ per process for message simulation.
Achieve strong, controllable privacy in federated biomedical AI without sacrificing performance, thanks to a lightweight key-embedded implicit neural representation.
Save time and resources: predict federated learning performance *before* deployment by quantifying dataset and client complexity.
A space-tailored OS blows Kubernetes out of the water in task completion by nearly 100%, thanks to smarter resource awareness in fragmented, network-constrained environments.
Differentiable optimization can supercharge classical ILP solvers, slashing runtime by 10x on combinatorial scheduling problems.
Open-source RISC-V microcontrollers are now easier to build, thanks to a streamlined design and fully open RTL-to-GDS flow.
Achieve high-speed, low-latency object detection in autonomous systems by integrating spiking neural networks and dynamic image signal processing on an FPGA.
Training large models without communication overhead is now plausible: OptINC uses optical interconnects to perform gradient averaging and quantization directly in the network.
Forget GPU-centric All-Reduce: SCIN's switch-based architecture slashes latency by up to 8.7x and boosts LLaMA-2 performance by 34% through in-network quantization.
Achieve up to 32.1% energy-delay product improvement in high-speed adders by co-optimizing prefix topology and standard cell mapping, outperforming commercial synthesis tools.
Forget relying on centralized trust: a decentralized witnessing-zone architecture can boost sensor data trustworthiness against fabricated events.
Optimizing OpenFOAM with GPU ports and selective-memory techniques slashes energy consumption by 28% and iteration time by 72% compared to purely hardware-focused approaches.
Apple's own vDSP FFT library gets smoked by a new implementation that's 29% faster, thanks to a clever two-tier memory model exploiting the GPU's register file and threadgroup memory.
Ternary LLMs can run up to 62x faster on CPU and 1.9x faster on CUDA with RSR-core, a new engine that finally brings theoretically fast low-bit matrix multiplication to practical hardware.
Switching HPC schedulers mid-lifecycle doesn't have to break everything: a carefully staged transition can dramatically improve queue times and user adoption.
Propagating mega-constellations is now 1500x faster thanks to a JAX-based SGP4 reimplementation, making large-scale collision avoidance tractable.
Current blockchain scalability solutions often fall short of meeting the stringent real-time demands of IoT applications, highlighting the need for adaptive and AI-driven approaches.
Multimodal federated learning can finally handle the messy reality of missing data with BLOSSOM's block-wise personalization, boosting performance by up to 37.7% compared to naive aggregation.
Multi-chiplet architectures can unlock significant speedups and memory savings for low-batch MoE inference by dynamically scheduling expert computations across high-bandwidth die-to-die links.