Search papers, labs, and topics across Lattice.
Distributed training, model parallelism, AI accelerator design, and large-scale compute infrastructure.
#15 of 24
5
Training MoE models just got a whole lot faster: Piper achieves up to 3.5x higher MFU by intelligently scheduling pipeline parallelism and optimizing communication.
Stop training in isolation: LNTrust lets decentralized models learn *who* to trust during training, so they can collaborate effectively at deployment, boosting accuracy and cutting communication costs.
Granular Mixture-of-Experts can now be efficient: AIR-MoE's two-stage routing slashes routing costs without sacrificing performance.
LLMs can generate GPU kernels, but they're surprisingly bad at it: 72% of fusion tasks fail across all methods, and nearly half of the "correct" kernels are actually slower than PyTorch.
Federated learning struggles when data quality varies across clients, but FedQual solves this with a novel approach that calibrates low-quality clients while preserving high-quality autonomy.
Incentivizing honest participation in federated learning is now possible without ground truth labels, even when some participants are trying to game the system.
Fine-tune optimizer precision block-by-block and slash memory use without sacrificing model quality.
Decomposing robot swarm state representations unlocks effective cooperation even with computationally-limited agents.
Forget heuristics: this queueing theory framework precisely predicts LLM inference stability under KV cache constraints, letting you right-size your GPU cluster.
Atomic swaps can now handle probabilistic exchanges like lotteries and randomized allocations, opening up new possibilities for trustless cross-chain interactions.
Choosing between secure multi-party computation (SMPC) and fully homomorphic encryption (FHE) for secure ML depends heavily on the model architecture: FHE excels at regressions and simple networks, while SMPC dominates for complex CNNs.
Lattice-based cryptography's reliance on injected noise for security is more akin to hiding secrets under a rug than truly erasing them, leaving them vulnerable to future quantum attacks.
Ethereum builder centralization isn't just about who has the best order flow, but also about how network effects let incumbents decouple from needing exclusive deals.
RFT's Achilles heel? This benchmark reveals how fragile reinforcement fine-tuning is, and introduces an automated system to catch and fix training failures before they tank your LLM.
Proving semantic equivalence between LLVM IR and RISC-V code is now possible within a single framework, thanks to a new formal RISC-V semantics built on Interaction Trees.
Offloading communication to SmartNIC DPUs can speed up host-dominated workloads by 1.55x, but the lack of Direct Cache Access creates a massive DRAM bottleneck.
MARL-optimized collaboration between large and small models in LEO satellites slashes service delays by nearly a third.
Generative recommenders can slash latency by up to 38% simply by dynamically juggling GPU memory between embedding and KV caches, a feat current systems miss.
Implicit time integration on GPUs gets a 3x speed boost thanks to a novel algebraic coarsening method that avoids costly explicit remeshing.
Run billions of bitwise operations directly in your 3D NAND flash, error-free, using just standard instructions.
Exponent bits are the Achilles' heel of floating-point arithmetic, as corrupting them in RISC-V vector processors leads to the most severe silent data corruption.
Radically reduce power consumption in AI chips with a circuit-switched network-on-chip that carves out dedicated "lanes" for predictable communication flows.
RangeGuard lets you tolerate 64+ flipped bits in DNN memory using just 16 bits of parity, without sacrificing accuracy.
Save up to 2.79x on LLM serving costs by intelligently distributing models across a diverse fleet of cloud GPUs.
Forget trusted online policy enforcement points: this revocation-ready key management layer uses ciphertext key publication to enforce dynamic, multi-user authorization for releasing or using bulk-data decryption keys in blockchain-based IoT data sharing systems.
Get strong pointer integrity and confidentiality without metadata overhead: LIPPEN encrypts pointers in-place, turning every pointer into a cryptographically protected block.
Stochastic sampling from p-bit Ising models can slash the search effort of CDCL SAT solvers by over 80% on certain problem instances.
A new cryptographic system promises top-level security for IoT gadgets without sacrificing performance, a rare win for resource-constrained devices.
The transition to post-quantum cryptography isn't just about swapping algorithms; it demands a complete architectural rethink of networked systems, especially regarding key distribution and management.
Rowhammer attacks aren't just for CPUs anymore: a malicious CUDA kernel can now leverage targeted bit flips to achieve root access on a system, even bypassing IOMMU protections.
Publicly available firmware for ASIC cryptocurrency miners is riddled with vulnerabilities, making the distribution mechanism itself a primary attack surface.
Achieve a 2.9x reduction in end-to-end latency in ROS 2 communication by trading off scalability for simplicity in cross-process object lifetime management.
Achieve near order-of-magnitude reduction in tail timing error in mixed-criticality robotics by decoupling safety-critical control from user applications.
Simulating complex fluid dynamics with moving boundaries just got 20x faster thanks to a new GPU-optimized immersed boundary method.
ClusterLess slashes workflow completion times by up to 40% and nearly doubles deadline satisfaction in federated edge environments, outperforming existing methods.
AI training jobs can now shrug off network failures that used to halt progress, thanks to a new resilient networking stack deployed at OpenAI and Microsoft.
Serverless orchestration falls apart when you move it to space, but this paper proposes a new architecture to fix it.
Control heterogeneous physical neural networks—even wetware—with a single orchestration architecture, opening the door to seamless integration with edge-cloud workflows.
Get up to 40% performance boost and 15% energy savings on scientific computing kernels by offloading OpenMP loops to AMD's AI Engines with minimal code changes.
Sub-logarithmic MPC protocols for super-linear problems are fundamentally limited: you can't cheat time complexity without paying a steep price in local computation.
Forget simplistic roofline models: these analytical models nail GPU performance prediction on Blackwell and CDNA3 with under 1.5% error.
Ditching the global MPI_COMM_WORLD communicator unlocks significant scalability gains for MPI applications on exascale systems.
Standard federated learning deployments can catastrophically fail with just 5-second latency or 50% packet loss, revealing a fundamental mismatch between FL's communication patterns and default TCP configurations.
Analyzing exascale performance bottlenecks just got hundreds of times faster, thanks to a new GPU-accelerated framework that pinpoints congestion and predicts optimization opportunities in scientific workloads.
Storage scarcity in edge caching doesn't just impact performance, it fundamentally shifts the economic landscape, amplifying inequality among content providers.
Forget running the full gauntlet: just 4-5 workloads from SPEC CPU2026 can accurately mirror the entire suite, slashing evaluation costs without sacrificing fidelity.
Forget assuming NaNs and single-bit flips are the main culprits in GPU silent data corruption; this study reveals they're surprisingly rare, demanding a rethink of fault modeling.
Formal reasoning about programmable memory hierarchies is now possible, thanks to a new ISA-level memory consistency model that tames the complexity of architectures like t\"{a}k\={o}.
Achieve near-identical object detection results compared to the ONNX model while drastically reducing computational cost by implementing a binarized YOLOv3-tiny on a low-cost FPGA.
Guaranteeing software stability during remodularization doesn't require sacrificing performance; a multi-agent consensus protocol can match state-of-the-art optimizers while acting as a "circuit breaker" for strict stability constraints.