Search papers, labs, and topics across Lattice.
Distributed training, model parallelism, AI accelerator design, and large-scale compute infrastructure.
#12 of 24
1
Pythonistas rejoice: aggregate programming, a powerful paradigm for distributed systems, finally gets a first-class, easy-to-use library in your favorite language.
Automating detector design with AI can dramatically speed up scientific discovery by intelligently exploring complex parameter spaces.
LLMs' skewed matrix shapes need not hamstring systolic array performance: SISA's partitioned architecture achieves up to 8.52x speedup and 93% EDP reduction compared to monolithic arrays.
FL systems are far more vulnerable to backdoor attacks using realistic, semantically-aligned triggers (like sunglasses) than previously thought based on simple corner patches.
Commodity CPUs can be retrofitted with hardware-backed control flow attestation using hardware performance counters, enabling runtime attack detection in TEEs.
Now, clients can actually *verify* that their data has been removed from a federated learning model, even when the server is untrusted.
Achieve structured IPC and practical message movement in modular services with CNS, a lightweight hybrid event fabric that bridges in-process and inter-node communication with minimal overhead.
Stop averaging prototypes blindly: FedDBP uses Fisher information to intelligently fuse local prototypes, significantly boosting performance in heterogeneous federated learning.
Guaranteeing safety in multi-agent systems with dynamic networks doesn't have to sacrifice performance: this plug-and-play protocol ensures recoverable safety even when agents join/leave or network topologies shift.
Achieve HPC acceleration by emulating FP64 operations with INT8 precision on GPUs, proving that you can boost performance *and* accuracy.
Quantum circuit compilation, a major bottleneck, can be sped up by over 15x with minimal overhead using a new parallelization technique validated on 8000 large-scale, configurable random circuits.
Datacenter simulations can now combine multiple independent models to better predict performance and climate impact, addressing limitations of single-model approaches.
Unexplained P99.9 latency spikes in Apache Pulsar could be due to a previously undocumented Linux kernel page cache writeback interaction inside BookKeeper's ForceWriteThread, even with dedicated NVMe drives.
Sometimes, knowing less (limiting computation to polynomial time) can let you decide *more* in distributed systems, especially with universal certificates.
Dataflow networks can achieve significant energy savings without sacrificing throughput by strategically powering down actors during idle periods, a balance efficiently discovered using a novel "Hop and Skip" exploration strategy.
Pinpointing performance bottlenecks in large-scale AI training just got 100x faster, thanks to a new system that watches the whole stack without slowing things down.
Finally, a gem5-integrated simulator that accurately models CXL memory expansion for LLMs, capturing real-world effects like cache pollution.
Achieve up to 4.17x speedup in DRL training by intelligently partitioning tasks across CPUs, FPGAs, and AI Engines on AMD Versal ACAP, demonstrating the power of hardware-aware algorithm design.
Unlock 600,000x faster TSV design by replacing computationally expensive full-wave simulations with physics-informed graph neural networks.
Calculating excited states of molecules with thousands of atoms, previously a computational bottleneck, is now practical on a single GPU thanks to a new implementation of TDDFT-risp.
Optimized LoRaWAN gateway placement hinges on the channel model used, with ray tracing offering higher fidelity but at a significant computational cost.
Unbalanced class prevalence, not just disjoint label sets, is the dominant factor hindering federated learning performance under label-space heterogeneity.
Compromised 5G networks can be weaponized with chained, undetectable command and control channels, enabling attacks that bypass existing security measures.
Second-order federated learning can be made robust and practical: FedRCO overcomes instability issues and outperforms first-order methods in non-IID settings.
Differentiable Power-Flow unlocks scalable, gradient-based optimization for power grid management, outperforming traditional methods and enabling new applications like real-time contingency analysis.
Federated learning can overcome data sparsity and privacy concerns to improve livestock growth prediction using real-world farm data.
Agentic RL rollouts are bottlenecked by long-tail trajectory generation, but Heddle's trajectory-centric approach achieves 2.5x higher throughput.
FedDES achieves instance-level personalization in federated learning by dynamically selecting and weighting peer models with a GNN, leading to significant performance gains in heterogeneous environments.
Guaranteeing robust distributed GenAI inference at the edge requires trust-aware routing, and G-TRAC achieves this with sub-millisecond routing latency.
Quantum-proofing your 5G core doesn't have to break the bank: a sidecar proxy can add post-quantum cryptography with a predictable 50ms latency hit.
Lightweight DisCNNs offer a surprisingly efficient route to object detection by exploiting monotonic relationships between network outputs and feature presence.
RDMA failover can be made significantly more efficient and correct by selectively retransmitting only the requests that were actually lost during a link failure, avoiding redundant retransmissions and semantic violations.
Squeezing loop control down to <10% of array resources unlocks near-zero-overhead parallel loop acceleration on Tightly Coupled Processor Arrays.
Forget CPUs and GPUs: MCPT-Solver uses spintronics and Bayesian inference to create a hardware random number generator that dramatically accelerates Monte Carlo particle transport simulations.
LLMs can now automatically evolve and optimize GPU kernels to beat hand-tuned and proprietary models like Gemini and Claude.
By cleverly repurposing an unused sign bit, IF4 achieves superior quantization performance compared to NVFP4 without increasing bit-width.
Forget slow rotations: IsoQuant's quaternion-based approach warps RotorQuant in LLM KV cache compression, delivering up to 6x speedups on synthetic data.
Blockchain-based federated learning can be made practical by using multi-task peer prediction to overcome the computational bottleneck of contribution measurement.
Bitcoin can be more than just digital gold: BitSov proposes a composable architecture for a censorship-resistant internet, anchored to Bitcoin's blockchain, that could reshape how we build decentralized applications.
Backdoor defenses can be baked into the pre-training phase of federated learning, achieving state-of-the-art attack mitigation with minimal impact on clean accuracy.
FedBBA slashes backdoor attack success rates to as low as 1.1% in federated learning, leaving existing defenses in the dust.
Achieve secure outsourced decision tree evaluation without any communication between servers, unlocking faster and more scalable MLaaS deployments.
Flow-matching generative models can simultaneously defend against poisoning attacks and preserve privacy in federated learning, outperforming existing methods in accuracy and robustness.
Pinpointing root causes in distributed systems just got easier: Lumos automatically exposes the computational history of bugs with low overhead, even with limited bug occurrences.
Forget hand-coding adapters: this middleware uses LLMs to automatically bridge REST APIs, GraphQL endpoints, and IoT devices with a 90% success rate.
Real-time 3D occupancy mapping for edge devices is now possible under a 6mW power budget thanks to Gleanmer, a novel SoC.
Cloud databases are leaving performance on the table: optimizing kernel-space I/O can yield up to 9x speedups without requiring kernel or database patches.
Securing and accelerating Slurm cluster access is now possible without rewriting existing tools, thanks to a lightweight proxy that adds granular permissions and caching.
LLM inference bottlenecks aren't just compute-bound: heterogeneous GPU-FPGA systems can slash memory processing overheads by up to 2x while simultaneously reducing energy consumption.
Distributed vertex coloring can now be solved in near-optimal $\tilde{O}(\log^4 \log n)$ rounds, closing the gap with the theoretical lower bound and exponentially improving performance for graphs with small maximum degree.