Search papers, labs, and topics across Lattice.
100 papers published across 6 labs.
Achieve high-precision multi-robot SLAM with minimal data transmission by selectively compressing and transmitting keyframes and non-keyframes in a cloud-edge-robot architecture.
By intelligently suppressing boundary outliers before quantization, BS-KMQ slashes quantization error by 3x and boosts energy efficiency by 24x in in-memory computing.
Exploit the surprisingly stable, yet heterogeneous, sparsity patterns across attention heads to slash LLM attention latency by 2.88x without sacrificing quality.
Forget slow FP64: this work unlocks efficient double-precision matrix multiplication on modern GPUs by adapting the Ozaki-II scheme to run on faster FP8 hardware.
Forget retraining from scratch: incremental federated learning can keep your IoT intrusion detection models sharp against evolving threats, but the right update strategy is crucial for balancing accuracy and speed.
By intelligently suppressing boundary outliers before quantization, BS-KMQ slashes quantization error by 3x and boosts energy efficiency by 24x in in-memory computing.
Exploit the surprisingly stable, yet heterogeneous, sparsity patterns across attention heads to slash LLM attention latency by 2.88x without sacrificing quality.
Forget slow FP64: this work unlocks efficient double-precision matrix multiplication on modern GPUs by adapting the Ozaki-II scheme to run on faster FP8 hardware.
Forget retraining from scratch: incremental federated learning can keep your IoT intrusion detection models sharp against evolving threats, but the right update strategy is crucial for balancing accuracy and speed.
AI electricity demand won't necessarily explode as AI scales – whether it does or doesn't hinges on sustained efficiency improvements outpacing income-driven demand.
Forget ZKPs: this federated learning scheme uses "self-destructing" backdoors to verify aggregation integrity, achieving 1000x speedups over traditional crypto.
Guarantee runtime safety in complex cyber-physical systems with unbounded data domains using a refinement type system for parameterized streams, even though it's generally undecidable.
Training embodied intelligence models just got 40x faster thanks to a thousand-GPU cloud platform and a suite of optimizations spanning data pipelines, model architecture, and infrastructure.
Quantum-Centric Supercomputers promise to break down the barriers between quantum and classical computing, enabling seamless hybrid algorithms and accelerating discovery across applications.
SMEs can slash carbon emissions by 37% and costs by 3.6% simply by using Aceso's carbon-aware microservice placement, even with regionally limited infrastructure.
Maximize your LLM's goodput without diving into its internals: a new black-box controller uses hill climbing on end-to-end measurements to optimize performance.
By adaptively weighting neighbor information based on uncertainty, distributed multi-object tracking can achieve significantly better performance in mobile robot networks with heterogeneous localization quality.
Multi-robot systems can slash battery consumption by 15% and boost GPU utilization by 50% for large DNN inference by using a hybrid offline-online reinforcement learning strategy to dynamically schedule and distribute DNN module execution.
Accuracy leaderboards mislead: lightweight classical anomaly detectors surprisingly outperform deep methods when deployed under the throughput constraints of in-vehicle monitoring systems.
Secure multi-tenant LLM serving without sacrificing performance is now possible: CacheSolidarity selectively isolates prefixes, boosting cache reuse by up to 70% and cutting inference latency by 30% compared to blunt-force defenses.
Quantifying the overhead of post-quantum cryptography reveals exactly where the performance bottlenecks lie in real-world TLS 1.3 transactions.
Algorithm-hardware co-design could revolutionize medical technology, but realizing its potential requires a fundamental shift in how these systems are conceived, designed, validated, and translated into practice.
Trajectory optimization just got a whole lot faster and more energy-efficient: a GPU-native solver achieves 4x speedup and halves energy consumption compared to optimized CPU baselines.
Stop neural network model theft: bind your models to specific hardware using PUFs, rendering them useless on clones.
Achieve high-precision multi-robot SLAM with minimal data transmission by selectively compressing and transmitting keyframes and non-keyframes in a cloud-edge-robot architecture.
Uncovers hidden architectural inefficiencies in serverless platforms by modeling function interactions as topological flows and identifying persistent "harmonic modes" that resist local fixes.
A pipelined FPGA architecture slashes the power consumption of JPEG XS's Intra Pattern Copy displacement vector search, enabling practical hardware deployment for low-latency image compression.
A fully open-source speech understanding model, OSUM-Pangu, proves that competitive performance is achievable on non-CUDA hardware, challenging the dominance of GPU-centric ecosystems.
CD-Raft slashes distributed consensus latency by nearly 50% in cross-domain settings, offering a significant speedup for data-intensive AI workloads.
Secure coded caching, crucial for modern content delivery, often treats security as an afterthought, resulting in fragmented solutions that this review seeks to unify and improve.
AgentServe achieves up to 2.8x improvement in time-to-first-token and 2.7x in tokens-per-output-token for agentic workloads on a single GPU by strategically isolating prefills and decodes.
Uncover hidden network structure and simplify management by automatically classifying hosts into meaningful roles based on their connection patterns.
On-device LLM inference can be sped up by an order of magnitude with a flexible TrustZone-based system that selectively protects memory and the NPU.
On-device LLM inference with PIM is now more practical: PIM-SHERPA resolves memory inconsistencies, slashing memory capacity needs by ~50% without sacrificing performance.
Ditch the latency tax of traditional scheduling: this new approach delivers data "just-in-time" for safety-critical systems, boosting performance without sacrificing reliability.
By strategically increasing hash collisions, Nemo slashes write amplification in flash caches for tiny objects, a persistent bottleneck even with advanced SSDs.
A virtualized XRootD frontend can sustain over 50 Gb/s throughput in real-world large-scale WAN transfers, challenging assumptions about virtualization overhead in high-performance data systems.
FP64 tensor cores, previously untapped for large-scale scientific computing, now unlock 2x speedups and 83% energy savings in finite element simulations on NVIDIA's latest GPUs.
Achieve fine-grained access control in searchable encryption without re-encryption or excessive interaction, enabling practical multi-client deployments in dynamic clouds.
Stream 3D Gaussian Splatting scenes with higher visual quality and lower bandwidth by predicting user viewpoints and dynamically adapting bitrate using deep reinforcement learning.
Distributing SciML models with hardware and physics awareness slashes latency and energy consumption by over 8x and 33x, respectively, while paradoxically *improving* reconstruction fidelity.
By incorporating language guidance into federated learning, SurgFed tackles the long-standing problem of tissue and task heterogeneity in surgical video understanding, leading to improved segmentation and depth estimation across diverse surgical settings.
Forget waiting hours: this MORL framework achieves 270x speedups on robotics tasks thanks to GPU-native parallelization.
Nezha shatters I/O bottlenecks in distributed key-value stores by decoupling key-value persistence within Raft, yielding up to 4.6x throughput gains.
By recombining subgraphs from sparse models without retraining, "model stitching" creates a diverse set of model variants that significantly improves the efficiency of multi-DNN inference on edge SoCs.
Finally, analog joint source-channel coding can be deployed on standard digital transceivers, unlocking the potential of semantic communication on existing infrastructure.
TMFGs can now scale to millions of data points thanks to a-TMFG, which approximates the correlation matrix on-the-fly using kNN graphs and clever memory management.
Get up to 24x faster sine/cosine calculations on ESP32 microcontrollers by dynamically switching between fixed-point and floating-point precision.
Forget slow, iterative distributed signal estimation: dMWF achieves optimal multichannel Wiener filtering in wireless acoustic sensor networks without iteration, even when nodes observe different sources.
IoT devices struggling with weak entropy can now get a cryptographic boost from a RISC-V trusted execution environment, turning entropy provisioning into a manageable service.
Achieve higher accuracy and faster convergence in split learning by intelligently pruning communication channels based on label awareness.
Forget shaving yaks – this new protocol slashes communication costs in distributed expert learning while *improving* regret bounds.
Achieve up to two orders of magnitude reduction in semantic communication rate by strategically incorporating common randomness in a privacy-preserving distributed computation framework.
LLMs can get a 27.8% boost in mathematical reasoning by fusing a hardware-efficient optimal control layer directly into their architecture, enabling planning before prediction.
Latency's impact on VR whiteboard collaboration isn't uniform: it disproportionately degrades specific QoE dimensions, varying significantly between structured design and free-form discussion.
Traditional time-based authorization schemes are dangerously slow in multi-agent systems: a new coherence strategy slashes unauthorized API calls by over 100x, offering a velocity-agnostic safety guarantee.
Multi-prototype-guided federated learning overcomes data heterogeneity in edge computing, boosting accuracy and reducing errors compared to single-prototype methods.
Noise in photonic quantum systems severely limits the performance of quantum machine learning algorithms, demanding robust noise mitigation strategies for practical implementations.
On-device fine-tuning of Transformers is now feasible on ultra-low-power, memory-constrained edge devices thanks to TrainDeeploy, which achieves up to 11 trained images per second on a RISC-V SoC.
K-means, previously relegated to offline processing, gets a 17.9x speed boost on modern GPUs thanks to Flash-KMeans' clever IO and contention optimizations.
By ditching Python for optimized C++/CUDA kernels, ImprovedGS+ slashes 3D Gaussian Splatting training time by 26.8% while using 13.3% fewer Gaussians and maintaining superior visual quality.
Achieve near-perfect privacy against clustering and inversion attacks in split learning without sacrificing model accuracy by using differential privacy and secret label obfuscation.
Caching and speculative transcoding can drastically reduce the computational burden of on-the-fly point cloud transcoding, enabling scalable streaming systems.
Squeezing 11x more performance from your datacenter GPUs is now possible for compound inference tasks, thanks to JigsawServe's adaptive model selection and fine-grained spatial partitioning.
Lockbox offers a practical blueprint for enterprises to adopt cloud-based AI processing on sensitive data without compromising security, by implementing a zero-trust architecture.
Aerospace maintenance gets a trust upgrade: BladeChain uses blockchain to ensure tamper-proof, auditable AI-driven engine blade inspections.
Uncovers hidden architectural inefficiencies in serverless platforms by applying Hodge decomposition to analyze inter-function information flow.
Euclidean distance isn't the best way to measure gradient staleness in asynchronous federated learning: alternative distance metrics can significantly improve convergence and stability.
Decentralized z-anonymity is now practical: deZent achieves comparable performance to centralized approaches while minimizing reliance on a trusted central entity.
Blockchain's consensus protocols face critical security, scalability, and energy consumption challenges that demand further research despite their pivotal role in decentralized systems.
Slash blockchain bloat by an order of magnitude: AR-ACE ships compact attestations, not bulky validity proofs, through mempool and relay networks.
Imagine an embedded OS where the scheduler, allocator, DMA drivers, and all peripherals are fully untrusted—this paper shows how to build it.
Slash blockchain transaction sizes by an order of magnitude with ZK-ACE, which replaces bulky post-quantum signatures with succinct, identity-based zero-knowledge proofs.
FedPrism dynamically adapts to non-IID data in federated learning by decomposing client models into global, group, and private components, outperforming traditional aggregation methods.
MoE models, despite their training efficiency, can be structurally 4.5x slower than quality-matched dense models at inference due to memory fragmentation, especially in long-context scenarios.
Democratized LLM pre-training is now a reality: Covenant-72B proves you can train a competitive 72B model with untrusted peers over the internet, opening the door to broader participation and reduced costs.
Get 3.6x faster long-context LLM inference with LycheeCluster's hierarchical KV indexing, which avoids the semantic fragmentation of naive chunking.
FPGAs aren't just for SmartNICs anymore: SafarDB shows they can directly accelerate distributed transactions with 7-12x speedups by tightly integrating with the network.
LLMs hallucinate far more than you think in document Q&A, with fabrication rates tripling as context grows from 32K to 128K tokens, and model selection matters more than hyperparameter tuning or hardware.
FedLECC slashes communication overhead in federated learning by 50% while boosting accuracy by 12%, all by cleverly picking clients based on data similarity and loss.
Overcome memory bottlenecks in drone-based Synthetic Aperture Radar (SAR) imaging with a new online reconstruction method that processes data incrementally.
SVD-powered aggregation in FedMomentum lets LoRA modules in federated learning retain crucial training momentum, leading to faster convergence and better performance.
Federated differentially private data synthesis can now achieve utility comparable to centralized approaches, even with heterogeneous data distributions, thanks to a novel framework that smartly handles noise and redundancy.
Unlock cloud-scale AI for enterprises without sacrificing data privacy: SplitAgent dynamically sanitizes sensitive data based on task context, boosting accuracy and privacy compared to static methods.
Tree speculative decoding can achieve up to 2.46x speedup on Ascend NPUs, but only if you carefully manage the branch/commit cache and eliminate undefined negative indices.
Achieve global-optimal GEMM mapping for spatial accelerators orders of magnitude faster than existing methods by analytically modeling the mapping space geometrically.
A Shapley-incentivized blockchain boosts federated learning accuracy by 14% and thwarts 90% of malicious attacks in high-speed rail data sharing.
Lattice dares to launch a cryptocurrency designed from the ground up to be post-quantum secure, ditching classical signature fallbacks entirely.
Beat the LLM inference bottleneck: SageSched's uncertainty-aware scheduling boosts efficiency by nearly 30% by predicting output length and balancing compute and memory demands.
Quantum advantage in chemistry may be further off than we thought: a new GPU-accelerated iQCC implementation simulates 100-200 qubit systems, outperforming classical methods on industrially relevant ruthenium catalysts.
Get near-peak performance for your recommender system across GPUs and TPUs without tedious platform-specific tuning, thanks to a new cross-accelerator graph optimization framework.
LLMs waste 21.8% of their context window on structural inefficiencies, but a demand paging system can slash context consumption by up to 93% without sacrificing performance.
Achieve 3% accuracy gains and 20% delay reduction in split federated learning simply by jointly optimizing model partitioning and client assignments.
Turn energy-intensive crypto mining into a data compression service with Proof-of-Encryption-Work (PoEW), a novel consensus mechanism.
Bridge the trust gap in cloud-based LLM services with AFTUNE, a practical framework that lets you audit proprietary fine-tuning and inference without prohibitive overhead.
Cloud autoscaling can be more than just reactive: MAS-H2 shows how a hierarchical multi-agent system can proactively optimize resource allocation based on high-level business policies, slashing CPU stress by 50% and enabling zero-downtime migrations.
Forget CPU bottlenecks: a fully GPU-resident architecture verifies Goldbach's conjecture up to $10^{12}$ in under 40 seconds on a single RTX 5090.
By intelligently leveraging application data characteristics and machine learning, microarchitectural designs can overcome memory bottlenecks and achieve substantial performance and energy efficiency gains.
Automating multi-service deployments in edge-cloud environments doesn't have to be a headache: CODECO slashes manual effort while keeping performance competitive.
Slash overhead and boost resilience in massive dynamic networks with Structured Gossip DNS, a passively stabilizing system that cuts message complexity by an order of magnitude.
Today's high-performance interconnects are built on shaky semantic ground, potentially sacrificing concurrency for reliability through hidden serialization.
Diffusion models can now run with 3x better energy efficiency and 5.5x higher throughput thanks to a silicon photonics accelerator.
Training trillion-parameter Mixture-of-Experts models just got a whole lot faster: Megatron Core now achieves >1 PFLOP/GPU on NVIDIA's latest hardware.
Squeeze 46% more LLM inference throughput from your many-core CPUs with ArcLight, a new architecture that overcomes the cross-NUMA memory access bottleneck.
MEV has evolved from simple miner extraction to a complex cross-chain phenomenon, and this SoK provides a unified framework to understand its past, present, and future.