Search papers, labs, and topics across Lattice.
70 papers published across 6 labs.
Rowhammer attacks aren't just for CPUs anymore: a malicious CUDA kernel can now leverage targeted bit flips to achieve root access on a system, even bypassing IOMMU protections.
Forget heuristics: this queueing theory framework precisely predicts LLM inference stability under KV cache constraints, letting you right-size your GPU cluster.
Training MoE models just got a whole lot faster: Piper achieves up to 3.5x higher MFU by intelligently scheduling pipeline parallelism and optimizing communication.
Stop training in isolation: LNTrust lets decentralized models learn *who* to trust during training, so they can collaborate effectively at deployment, boosting accuracy and cutting communication costs.
Granular Mixture-of-Experts can now be efficient: AIR-MoE's two-stage routing slashes routing costs without sacrificing performance.
Training MoE models just got a whole lot faster: Piper achieves up to 3.5x higher MFU by intelligently scheduling pipeline parallelism and optimizing communication.
Stop training in isolation: LNTrust lets decentralized models learn *who* to trust during training, so they can collaborate effectively at deployment, boosting accuracy and cutting communication costs.
Granular Mixture-of-Experts can now be efficient: AIR-MoE's two-stage routing slashes routing costs without sacrificing performance.
LLMs can generate GPU kernels, but they're surprisingly bad at it: 72% of fusion tasks fail across all methods, and nearly half of the "correct" kernels are actually slower than PyTorch.
Federated learning struggles when data quality varies across clients, but FedQual solves this with a novel approach that calibrates low-quality clients while preserving high-quality autonomy.
Incentivizing honest participation in federated learning is now possible without ground truth labels, even when some participants are trying to game the system.
Fine-tune optimizer precision block-by-block and slash memory use without sacrificing model quality.
Decomposing robot swarm state representations unlocks effective cooperation even with computationally-limited agents.
Forget heuristics: this queueing theory framework precisely predicts LLM inference stability under KV cache constraints, letting you right-size your GPU cluster.
Atomic swaps can now handle probabilistic exchanges like lotteries and randomized allocations, opening up new possibilities for trustless cross-chain interactions.
Choosing between secure multi-party computation (SMPC) and fully homomorphic encryption (FHE) for secure ML depends heavily on the model architecture: FHE excels at regressions and simple networks, while SMPC dominates for complex CNNs.
Lattice-based cryptography's reliance on injected noise for security is more akin to hiding secrets under a rug than truly erasing them, leaving them vulnerable to future quantum attacks.
Ethereum builder centralization isn't just about who has the best order flow, but also about how network effects let incumbents decouple from needing exclusive deals.
RFT's Achilles heel? This benchmark reveals how fragile reinforcement fine-tuning is, and introduces an automated system to catch and fix training failures before they tank your LLM.
Proving semantic equivalence between LLVM IR and RISC-V code is now possible within a single framework, thanks to a new formal RISC-V semantics built on Interaction Trees.
Offloading communication to SmartNIC DPUs can speed up host-dominated workloads by 1.55x, but the lack of Direct Cache Access creates a massive DRAM bottleneck.
MARL-optimized collaboration between large and small models in LEO satellites slashes service delays by nearly a third.
Generative recommenders can slash latency by up to 38% simply by dynamically juggling GPU memory between embedding and KV caches, a feat current systems miss.
Implicit time integration on GPUs gets a 3x speed boost thanks to a novel algebraic coarsening method that avoids costly explicit remeshing.
Run billions of bitwise operations directly in your 3D NAND flash, error-free, using just standard instructions.
Exponent bits are the Achilles' heel of floating-point arithmetic, as corrupting them in RISC-V vector processors leads to the most severe silent data corruption.
Radically reduce power consumption in AI chips with a circuit-switched network-on-chip that carves out dedicated "lanes" for predictable communication flows.
RangeGuard lets you tolerate 64+ flipped bits in DNN memory using just 16 bits of parity, without sacrificing accuracy.
Save up to 2.79x on LLM serving costs by intelligently distributing models across a diverse fleet of cloud GPUs.
Forget trusted online policy enforcement points: this revocation-ready key management layer uses ciphertext key publication to enforce dynamic, multi-user authorization for releasing or using bulk-data decryption keys in blockchain-based IoT data sharing systems.
Get strong pointer integrity and confidentiality without metadata overhead: LIPPEN encrypts pointers in-place, turning every pointer into a cryptographically protected block.
Stochastic sampling from p-bit Ising models can slash the search effort of CDCL SAT solvers by over 80% on certain problem instances.
A new cryptographic system promises top-level security for IoT gadgets without sacrificing performance, a rare win for resource-constrained devices.
The transition to post-quantum cryptography isn't just about swapping algorithms; it demands a complete architectural rethink of networked systems, especially regarding key distribution and management.
Rowhammer attacks aren't just for CPUs anymore: a malicious CUDA kernel can now leverage targeted bit flips to achieve root access on a system, even bypassing IOMMU protections.
Publicly available firmware for ASIC cryptocurrency miners is riddled with vulnerabilities, making the distribution mechanism itself a primary attack surface.
Achieve a 2.9x reduction in end-to-end latency in ROS 2 communication by trading off scalability for simplicity in cross-process object lifetime management.
Achieve near order-of-magnitude reduction in tail timing error in mixed-criticality robotics by decoupling safety-critical control from user applications.
Simulating complex fluid dynamics with moving boundaries just got 20x faster thanks to a new GPU-optimized immersed boundary method.
ClusterLess slashes workflow completion times by up to 40% and nearly doubles deadline satisfaction in federated edge environments, outperforming existing methods.
AI training jobs can now shrug off network failures that used to halt progress, thanks to a new resilient networking stack deployed at OpenAI and Microsoft.
Serverless orchestration falls apart when you move it to space, but this paper proposes a new architecture to fix it.
Control heterogeneous physical neural networks—even wetware—with a single orchestration architecture, opening the door to seamless integration with edge-cloud workflows.
Get up to 40% performance boost and 15% energy savings on scientific computing kernels by offloading OpenMP loops to AMD's AI Engines with minimal code changes.
Sub-logarithmic MPC protocols for super-linear problems are fundamentally limited: you can't cheat time complexity without paying a steep price in local computation.
Forget simplistic roofline models: these analytical models nail GPU performance prediction on Blackwell and CDNA3 with under 1.5% error.
Ditching the global MPI_COMM_WORLD communicator unlocks significant scalability gains for MPI applications on exascale systems.
Standard federated learning deployments can catastrophically fail with just 5-second latency or 50% packet loss, revealing a fundamental mismatch between FL's communication patterns and default TCP configurations.
Analyzing exascale performance bottlenecks just got hundreds of times faster, thanks to a new GPU-accelerated framework that pinpoints congestion and predicts optimization opportunities in scientific workloads.
Storage scarcity in edge caching doesn't just impact performance, it fundamentally shifts the economic landscape, amplifying inequality among content providers.
Forget running the full gauntlet: just 4-5 workloads from SPEC CPU2026 can accurately mirror the entire suite, slashing evaluation costs without sacrificing fidelity.
Forget assuming NaNs and single-bit flips are the main culprits in GPU silent data corruption; this study reveals they're surprisingly rare, demanding a rethink of fault modeling.
Formal reasoning about programmable memory hierarchies is now possible, thanks to a new ISA-level memory consistency model that tames the complexity of architectures like t\"{a}k\={o}.
Achieve near-identical object detection results compared to the ONNX model while drastically reducing computational cost by implementing a binarized YOLOv3-tiny on a low-cost FPGA.
Guaranteeing software stability during remodularization doesn't require sacrificing performance; a multi-agent consensus protocol can match state-of-the-art optimizers while acting as a "circuit breaker" for strict stability constraints.
Slash sensor application development time from weeks to days by leveraging AI-assisted pattern reuse for intent-driven workflow design.
LLM serving can get a 34% boost in end-to-end SLO attainment by intelligently scheduling prefill and decode requests based on urgency and slack.
Rapidly prototype sensor-driven applications across diverse infrastructures without needing cross-domain expertise using AI-assisted, pattern-based workflow engineering.
RISC-V accelerators, originally for AI, can efficiently run scientific simulations, but only with the right parallelization strategy.
Quantum circuit optimization doesn't always improve distributed execution: sometimes, local optimization surprisingly beats global methods at minimizing communication costs.
Bayesian optimization can automatically tune Hyperledger Fabric configurations to achieve double-digit throughput improvements, but the impact of measurement noise on interpreting gains cannot be ignored.
Commodity GPU servers can achieve surprisingly high LLM inference throughput by cleverly orchestrating pipeline parallelism with KV cache offloading.
FedPLT achieves full-model accuracy in federated learning while training up to 82% fewer parameters per client, slashing communication costs and enabling participation from resource-constrained devices.
CAVs can now detect sensor anomalies in their measurements without relying on a central unit, even when tracking human-driven vehicles that aren't directly observable.
Hands-on experience with Raspberry Pi clusters and student-driven learning can effectively bridge the HPC skills gap in undergraduate engineering education.
HPC storage benchmarks hide a wealth of insights into filesystem-specific overheads and load imbalances, if you're willing to dig into the logs.
Agentic workflows can be sped up by 4.6x, not through faster LLMs, but by optimizing data flow and communication between components.
FedQueue tackles the Achilles' heel of federated learning on HPC clusters - unpredictable queue delays - by explicitly modeling and mitigating their impact, leading to significant speedups.
Random quantum circuits, a common proxy for real workloads, can mislead the design of distributed quantum computing compilers by distorting hypergraph partitioning performance.
Offloading geospatial data sampling to the edge slashes latency and bandwidth costs, achieving cloud-competitive accuracy with 80% less data.
Hierarchical power allocation in datacenters can achieve near-perfect satisfaction ratios, even with oversubscription, by using a novel three-phase QP/LP optimization policy.
Optimizing for runtime in multimodal training can be energy-inefficient, as data movement and overlap on Grace Hopper chips dominate energy consumption, not raw compute.
Untangling the chaotic web of microservice failures just got easier: a new model uses temporal graph neural networks to pinpoint faults by jointly learning how services evolve and interact.
Cut KV-cache transfer times by up to 32% with SplitZip, a new GPU-friendly lossless compressor that unlocks faster disaggregated LLM serving.
LLMs can now intelligently orchestrate multi-agent systems, learning to optimize both individual agent actions and inter-agent cooperation for distributed black-box problems.