Search papers, labs, and topics across Lattice.
100 papers published across 5 labs.
Distributed GPU training slashes the time needed to train deep learning models for CFD, making accurate fluid simulation predictions accessible in a fraction of the time.
Unlock HPC application malleability without the headache of process respawning thanks to this unified dynamic resource management API.
Federated learning accuracy jumps by up to 7% simply by using a multi-task autoencoder to identify and filter out noisy or uninformative samples on each client.
Finally, a formal model that treats humans as more than just external noise in distributed systems, opening the door to verifiable grassroots platforms.
LLMs can now intelligently orchestrate multi-agent systems, learning to optimize both individual agent actions and inter-agent cooperation for distributed black-box problems.
LLMs can now intelligently orchestrate multi-agent systems, learning to optimize both individual agent actions and inter-agent cooperation for distributed black-box problems.
Shuffling data introduces a fundamental shift in the privacy-utility tradeoff for mean estimation, rendering locally differentially private (LDP) mechanisms suboptimal.
Forget chasing bigger GPUs – the future of AI inference could be literally baked into the hardware itself, unlocking 1000x gains in energy and speed.
By intelligently perturbing class prototypes based on their discriminative power, VPDR achieves a superior privacy-utility trade-off in federated learning compared to naive Gaussian noise.
Foundation model embeddings reveal hidden structure in federated datasets, enabling surprisingly effective client clustering without any training or communication overhead.
Managing thousands of LEO satellites just got easier: a novel graph learning approach slashes network management overhead while boosting forecasting accuracy.
Unlocking the energy-latency frontier reveals how much cheaper and greener AI inference could be if we strategically relocate computation based on latency tolerance.
Stop costly cross-chain NFT migrations before they start: a new feature-centric methodology predicts which NFT functionalities will break when moving between blockchains like Ethereum and Solana.
Volumetric videoconferencing doesn't have to freeze and stutter: ReVo recovers up to 32% of lost RGB data and slashes video freezes by 95% using a cross-layer approach.
Frustrated with clunky architecture simulators? Akita offers a breath of fresh air with its focus on developer experience, promising faster prototyping and experimentation.
NeuroRing achieves faster-than-real-time execution of a full-scale cortical microcircuit simulation on FPGAs, proving that scalable, energy-efficient SNN hardware is within reach.
Cerebras CS-3 can deliver 100x speedups over CPU for sparse matrix multiplication at 90% sparsity, but surprisingly, becomes *slower* than CPU beyond 99% sparsity.
Even approximately fair gift-giving is surprisingly hard in distributed systems: achieving any approximation for the Santa Claus problem requires $\Omega(\sqrt{n} + D)$ rounds.
Most MEV arbitrage opportunities on Polygon can be traced back to a single transaction, revealing surprising concentration in MEV creation across protocols.
Schedulers can boost throughput by 12% on chiplet-based systems simply by treating spatial locality as a first-class objective, even if it means sacrificing work-conservation.
Balancing processor utilization and Quality-of-Service in mixed-criticality systems just got easier with AnTi-MiCS and MulTi-MiCS, which automatically determine optimal low WCETs and improve QoS by up to 30%.
Order-execute blockchains can achieve 10x higher throughput in DeFi workloads by embedding flexible endorsement directly into the consensus mechanism, avoiding the high abort rates of execute-order-validate approaches.
Distributed GPU training slashes the time needed to train deep learning models for CFD, making accurate fluid simulation predictions accessible in a fraction of the time.
Unlock HPC application malleability without the headache of process respawning thanks to this unified dynamic resource management API.
HBM-PIM can achieve impressive matrix multiplication throughput (14.9 GFLOP/s) using a novel reduction-free outer-product dataflow, even without native reduction support.
Recovering type information from untyped GPU register files is the key to enabling effective binary analysis, unlocking reverse engineering and security analysis of proprietary GPU code.
Forget waiting – this new CIM architecture slashes LLM weight update latency by up to 87%, unlocking faster prefill and decoding.
Ternary LLMs can achieve impressive throughput and energy efficiency on edge devices, thanks to VitaLLM's co-designed hardware acceleration that overcomes workload imbalance and data dependency challenges.
Android's security-relevant IPC is now traceable on stock devices without app instrumentation, closing a critical visibility gap for security researchers and incident responders.
Juggling high-priority and low-priority ML inference requests on GPUs? Strait delivers up to 11% fewer missed deadlines for critical tasks.
Federated learning can overcome data silos, but struggles when clients have different label relationships; FedHarmony shows how to harmonize these differences, leading to better performance.
Ditch the encoder-decoder: LPWTNet's closed-form Laplacian pyramid decomposition offers efficient inference for statistical channel fingerprint construction in massive MIMO systems.
Adaptively weighting defenses in federated learning lets you robustly handle diverse attacks without needing the dataset on the server.
Achieve 100% agent recovery correctness with near-zero overhead by intelligently checkpointing only the OS state that actually matters.
LLM training bottlenecks? ZipCCL achieves up to 1.18x end-to-end speedups by losslessly compressing communication collectives, without sacrificing model quality.
LLMs trained with ScaleBox, a new high-fidelity code verification system, substantially outperform those trained with heuristic matching, suggesting current RLHF methods are bottlenecked by verification quality.
Real-time, GPU-accelerated Monte Carlo simulation makes probabilistic safety guarantees for Automatic Emergency Braking systems deployable, not just a validation afterthought.
Training LLMs on ultra-long contexts just got a whole lot easier: AutoSP automates sequence parallelism and activation checkpointing, boosting context length by up to 2.7x with negligible throughput cost.
Even with adversarial network changes and only local signals, surprisingly simple distributed algorithms can enable dynamic networks to self-organize and adapt to changing environmental goals.
Slash MoE serving costs by two-thirds with FaaSMoE, a serverless architecture that dynamically scales experts on demand.
Stop recomputing the same quantum circuits: a semantic cache slashes redundant simulations by up to 92% and speeds up real quantum hardware by 11x.
Fixing your parallelism strategy while tuning batch size (or vice versa) leaves performance on the table: COPUS adaptively co-tunes both for faster LLM training.
Asynchronous RL for LLMs doesn't have to sacrifice convergence for speed: DORA achieves 2-4x faster training by cleverly managing multiple policy versions during rollout.
Training complex multi-agent RL policies just got 3,500x faster thanks to a new engine that optimizes for memory access and data locality.
Dense matrix multiplication accelerators can surprisingly outperform dedicated sparse accelerators for sparse neural networks, offering better area and energy efficiency.
A holistic, industrial-grade V&V loop promises to accelerate and de-risk RISC-V chip design by integrating RTL validation, FPGA-based system-level testing, and continuous integration.
Emulating massive multi-core systems just got easier: EMiX lets you scale RISC-V emulation across multiple FPGAs without rewriting your RTL.
Forget grid search: LLMs can rapidly find energy-efficient inference parameters, outperforming traditional optimization methods with just a few human-guided prompts.
Replaying CI failures in embedded systems is now possible at scale: PhantomRun reconstructs over 90% of failing builds, opening the door to systematic debugging and failure analysis.
Stop sweating LLM migrations: this Bayesian framework lets you confidently swap models in production, even with limited human evals.
Ignoring the nuanced interplay between services and hosts in microservice architectures leaves nearly 50% of root causes undiscovered.
Overlapping validation and private-data acquisition of successive blocks with state-consistency checks and ledger updates can almost double Hyperledger Fabric's commit throughput.
Fine-tune massive LLMs like Qwen3-235B with 31K context on a single 8x RTX 4090 server, thanks to a novel pipeline schedule that eliminates the weight binding bottleneck.
Automated testing of dynamic resource management frameworks in HPC is now possible, catching faults earlier and simplifying maintenance.
Speculative decoding, typically used post-RL, can be integrated directly into RL training loops to accelerate LLM rollout generation by up to 2.5x.
Forget brute-force scaling: smarter tile and tensor mapping on 3D-stacked chips could unlock massive LLM inference gains.
Unlock 3x higher throughput in your data center by easily converting MPI applications to malleable jobs with a new library.
Edge LLM inference gets a serious speed boost: DUAL-BLADE's dual-path KV cache slashes latency by up to 42% and doubles SSD utilization.
Training a 1024-node SOM on a billion-sample dataset in just over 6 minutes shatters previous scalability limits, thanks to a novel framework that leverages multi-GPU execution, out-of-memory streaming, and flexible topologies.
MPI malleability can cut HPC workload times by over 25% in real-world conditions, but only if you account for parallel efficiency.
Squeeze more out of your hardware: TSP lets you shard both weights and activations across the same devices, unlocking memory savings for long-context training and inference.
Fine-tuning LLMs in federated settings just got easier: SplitFT lets clients adapt their cut layers and LoRA ranks, boosting performance and slashing communication costs.
Squeezing high-accuracy LLMs and VLMs onto client devices is now significantly more feasible, thanks to a new pipelined sharding technique that achieves up to 30x speedups and 10x VRAM reduction.
NVIDIA's closed-source driver secrets are out: researchers can now see the exact hardware commands triggered by CUDA code.
Tomorrow's 6G networks hinge on overcoming the design hurdles of mm-wave and sub-THz oscillators, and this review lays out the roadmap.
Naive RAPL-based energy monitoring can add nearly 50% overhead to your measurements, but optimized tools can keep it negligible.
Quantum Gatekeeper achieves near-perfect information hiding: without all four factors (password, shared secret, context string, and reference image signature), payload extraction fails silently, preventing even partial disclosure.
FlyClient, a lightweight blockchain verification protocol, gets closer to real-world deployment with a practical Zcash implementation and proof-size optimizations.
Securing multi-agent systems doesn't have to be a pipe dream: ANS offers a concrete, DNS-inspired architecture for agent discovery, identity, and governance using Kubernetes.
A modified Particle Swarm Optimization algorithm slashes computation offloading latency in vehicular networks, outperforming brute-force methods in dynamic, real-world scenarios.
Squeezing more out of 5G video calls is possible: StreamGuard boosts video conferencing quality by up to 70% by intelligently prioritizing different parts of the video stream.
Slash configuration drift by 42% and boost API propagation by 31% with this framework for governing APIs across AWS, Azure, and GCP.
Federated learning can achieve better accuracy-efficiency trade-offs under heterogeneous data by optimizing within a low-dimensional subspace and using a backfill-style update to retain residual components.
Patchwork learning gets a boost: GraphPL uses GNNs to flexibly integrate all observed modalities, achieving SOTA imputation performance even with noisy inputs.
SignSGD can beat Adam and even SGD with a few simple tweaks, proving that 1-bit quantization doesn't have to mean sacrificing accuracy.
Stop wasting bandwidth on irrelevant tokens: Fed-FSTQ uses Fisher information to selectively quantize and transmit only the most important tokens, slashing communication costs in federated LLM fine-tuning by up to 46x.
Compound AI systems can achieve nearly 4x throughput improvement and cut tail latency in half with a modular, autoscaling inference architecture.
Quantum-inspired attention networks can significantly improve task offloading performance in MEC networks, offering a practical path to more energy-efficient and sustainable edge computing.
Give undergrads supercomputer access, and they'll actually grok parallel computing.
Fresh masking between pipeline stages in NTT-based post-quantum crypto isn't just good practice, it's provably necessary to erase vulnerabilities arising from prior stages, as demonstrated with a machine-checked proof and a real-world hardware flaw.
Organizational coupling in microservices isn't just about architecture – it's heavily influenced by the "Connector" roles bridging organizational silos, suggesting targeted interventions are possible.
Automating system-level testing for distributed robotics is now more practical with a new language that handles complexity, non-determinism, and dynamic reconfiguration.
Prioritizing tiny objects on edge devices isn't just about detector accuracy; DenseScout shows that a lightweight, dense-response selector coupled with transport-aware runtime can drastically outperform traditional detectors under strict compute and latency budgets.
Federated LLM inference gets a speed boost: SpecFed's speculative decoding and compressed communication slashes latency without sacrificing generation quality.
Shifting computing workloads to periods of high renewable energy availability slashes both carbon emissions and operational costs for HPC clusters.
Unlock significant speedups in depthwise convolutions (up to 3.26x) with optimized CUDA kernels, even in restricted cloud environments lacking hardware performance counters.
FusionCIM slashes LLM inference energy costs by nearly 4x while doubling processing speed, setting a new benchmark for efficiency in AI hardware.
Key contribution not extracted.
Forget prefetching: DAK unlocks up to 3x faster LLM inference by enabling direct GPU access to remote memory, achieving near-optimal system bandwidth utilization.
Multi-agent LLM systems are leaving performance on the table by treating structured agent interactions as generic traffic; Pythia shows how to unlock substantial gains by exploiting workflow semantics at the serving layer.
Finally, a formal model that treats humans as more than just external noise in distributed systems, opening the door to verifiable grassroots platforms.
Stop leaving 10-70% of your MoE kernel throughput on the table: RaMP dynamically optimizes kernel configuration based on runtime expert routing, achieving up to 1.41x end-to-end speedup in vLLM serving.
Exclusive scan algorithms, often overlooked, get a speed boost with two new approaches that minimize communication overhead in parallel message-passing systems.
Mobile LLM inference just got a whole lot faster: AHASD achieves up to 4.2x throughput and 5.6x energy efficiency gains by intelligently decoupling and managing drafting and verification tasks on a PIM-NPU architecture.
On-device cardiac monitoring is now feasible on ultra-low-power wearables, achieving 98% accuracy at just 8.55mW.
FTQC multiprogramming is not just about qubit partitioning; it's a complex puzzle of structured floorplans, resource contention, and dynamic magic-state generation, and this work provides a framework to solve it.
LUT-based hardware architectures can achieve up to 2.2x area reduction for LLM inference by challenging conventional design assumptions and optimizing for activation data types.
Forget GPUs – NVLLM's 3D NAND-centric design slashes LLM inference latency by up to 37.9x on edge devices, making on-device LLMs a real possibility.
RecFlash slashes recommendation inference latency by up to 81% and energy consumption by nearly 92% through smart data remapping in NAND flash memory.
Forget GPU-centric designs: AMMA slashes attention latency by 15x and energy consumption by 7x with a memory-centric architecture for long-context LLMs.
Current adaptive microservice management systems only scratch the surface of real-world production dynamics, and their purported gains may be overstated.
TetrisG-SDK achieves up to 1.3x faster convolutional layer processing while slashing energy consumption by over 70% in some cases.
CacheFlow slashes LLM serving latency by up to 62% by rethinking KV cache restoration as a 3D-parallel scheduling problem, not just a recompute vs. I/O tradeoff.
Federated learning accuracy jumps by up to 7% simply by using a multi-task autoencoder to identify and filter out noisy or uninformative samples on each client.
Finally, a practical biometric authentication system offers provable security against large-scale data breaches without sacrificing scalability or requiring auxiliary identifiers.