Search papers, labs, and topics across Lattice.
100 papers published across 4 labs.
Distributed GPU training slashes the time needed to train deep learning models for CFD, making accurate fluid simulation predictions accessible in a fraction of the time.
Unlock HPC application malleability without the headache of process respawning thanks to this unified dynamic resource management API.
Slash sensor application development time from weeks to days by leveraging AI-assisted pattern reuse for intent-driven workflow design.
LLM serving can get a 34% boost in end-to-end SLO attainment by intelligently scheduling prefill and decode requests based on urgency and slack.
Rapidly prototype sensor-driven applications across diverse infrastructures without needing cross-domain expertise using AI-assisted, pattern-based workflow engineering.
Slash sensor application development time from weeks to days by leveraging AI-assisted pattern reuse for intent-driven workflow design.
LLM serving can get a 34% boost in end-to-end SLO attainment by intelligently scheduling prefill and decode requests based on urgency and slack.
Rapidly prototype sensor-driven applications across diverse infrastructures without needing cross-domain expertise using AI-assisted, pattern-based workflow engineering.
RISC-V accelerators, originally for AI, can efficiently run scientific simulations, but only with the right parallelization strategy.
Quantum circuit optimization doesn't always improve distributed execution: sometimes, local optimization surprisingly beats global methods at minimizing communication costs.
Bayesian optimization can automatically tune Hyperledger Fabric configurations to achieve double-digit throughput improvements, but the impact of measurement noise on interpreting gains cannot be ignored.
Commodity GPU servers can achieve surprisingly high LLM inference throughput by cleverly orchestrating pipeline parallelism with KV cache offloading.
FedPLT achieves full-model accuracy in federated learning while training up to 82% fewer parameters per client, slashing communication costs and enabling participation from resource-constrained devices.
CAVs can now detect sensor anomalies in their measurements without relying on a central unit, even when tracking human-driven vehicles that aren't directly observable.
Hands-on experience with Raspberry Pi clusters and student-driven learning can effectively bridge the HPC skills gap in undergraduate engineering education.
HPC storage benchmarks hide a wealth of insights into filesystem-specific overheads and load imbalances, if you're willing to dig into the logs.
Agentic workflows can be sped up by 4.6x, not through faster LLMs, but by optimizing data flow and communication between components.
FedQueue tackles the Achilles' heel of federated learning on HPC clusters - unpredictable queue delays - by explicitly modeling and mitigating their impact, leading to significant speedups.
Random quantum circuits, a common proxy for real workloads, can mislead the design of distributed quantum computing compilers by distorting hypergraph partitioning performance.
Offloading geospatial data sampling to the edge slashes latency and bandwidth costs, achieving cloud-competitive accuracy with 80% less data.
Hierarchical power allocation in datacenters can achieve near-perfect satisfaction ratios, even with oversubscription, by using a novel three-phase QP/LP optimization policy.
Optimizing for runtime in multimodal training can be energy-inefficient, as data movement and overlap on Grace Hopper chips dominate energy consumption, not raw compute.
Untangling the chaotic web of microservice failures just got easier: a new model uses temporal graph neural networks to pinpoint faults by jointly learning how services evolve and interact.
Cut KV-cache transfer times by up to 32% with SplitZip, a new GPU-friendly lossless compressor that unlocks faster disaggregated LLM serving.
LLMs can now intelligently orchestrate multi-agent systems, learning to optimize both individual agent actions and inter-agent cooperation for distributed black-box problems.
Shuffling data introduces a fundamental shift in the privacy-utility tradeoff for mean estimation, rendering locally differentially private (LDP) mechanisms suboptimal.
Forget chasing bigger GPUs – the future of AI inference could be literally baked into the hardware itself, unlocking 1000x gains in energy and speed.
By intelligently perturbing class prototypes based on their discriminative power, VPDR achieves a superior privacy-utility trade-off in federated learning compared to naive Gaussian noise.
Foundation model embeddings reveal hidden structure in federated datasets, enabling surprisingly effective client clustering without any training or communication overhead.
Managing thousands of LEO satellites just got easier: a novel graph learning approach slashes network management overhead while boosting forecasting accuracy.
Unlocking the energy-latency frontier reveals how much cheaper and greener AI inference could be if we strategically relocate computation based on latency tolerance.
Stop costly cross-chain NFT migrations before they start: a new feature-centric methodology predicts which NFT functionalities will break when moving between blockchains like Ethereum and Solana.
Volumetric videoconferencing doesn't have to freeze and stutter: ReVo recovers up to 32% of lost RGB data and slashes video freezes by 95% using a cross-layer approach.
Frustrated with clunky architecture simulators? Akita offers a breath of fresh air with its focus on developer experience, promising faster prototyping and experimentation.
NeuroRing achieves faster-than-real-time execution of a full-scale cortical microcircuit simulation on FPGAs, proving that scalable, energy-efficient SNN hardware is within reach.
Cerebras CS-3 can deliver 100x speedups over CPU for sparse matrix multiplication at 90% sparsity, but surprisingly, becomes *slower* than CPU beyond 99% sparsity.
Even approximately fair gift-giving is surprisingly hard in distributed systems: achieving any approximation for the Santa Claus problem requires $\Omega(\sqrt{n} + D)$ rounds.
Most MEV arbitrage opportunities on Polygon can be traced back to a single transaction, revealing surprising concentration in MEV creation across protocols.
Schedulers can boost throughput by 12% on chiplet-based systems simply by treating spatial locality as a first-class objective, even if it means sacrificing work-conservation.
Balancing processor utilization and Quality-of-Service in mixed-criticality systems just got easier with AnTi-MiCS and MulTi-MiCS, which automatically determine optimal low WCETs and improve QoS by up to 30%.
Order-execute blockchains can achieve 10x higher throughput in DeFi workloads by embedding flexible endorsement directly into the consensus mechanism, avoiding the high abort rates of execute-order-validate approaches.
Distributed GPU training slashes the time needed to train deep learning models for CFD, making accurate fluid simulation predictions accessible in a fraction of the time.
Unlock HPC application malleability without the headache of process respawning thanks to this unified dynamic resource management API.
HBM-PIM can achieve impressive matrix multiplication throughput (14.9 GFLOP/s) using a novel reduction-free outer-product dataflow, even without native reduction support.
Recovering type information from untyped GPU register files is the key to enabling effective binary analysis, unlocking reverse engineering and security analysis of proprietary GPU code.
Forget waiting – this new CIM architecture slashes LLM weight update latency by up to 87%, unlocking faster prefill and decoding.
Ternary LLMs can achieve impressive throughput and energy efficiency on edge devices, thanks to VitaLLM's co-designed hardware acceleration that overcomes workload imbalance and data dependency challenges.
Android's security-relevant IPC is now traceable on stock devices without app instrumentation, closing a critical visibility gap for security researchers and incident responders.
Juggling high-priority and low-priority ML inference requests on GPUs? Strait delivers up to 11% fewer missed deadlines for critical tasks.
Federated learning can overcome data silos, but struggles when clients have different label relationships; FedHarmony shows how to harmonize these differences, leading to better performance.
Ditch the encoder-decoder: LPWTNet's closed-form Laplacian pyramid decomposition offers efficient inference for statistical channel fingerprint construction in massive MIMO systems.
Adaptively weighting defenses in federated learning lets you robustly handle diverse attacks without needing the dataset on the server.
Achieve 100% agent recovery correctness with near-zero overhead by intelligently checkpointing only the OS state that actually matters.
LLM training bottlenecks? ZipCCL achieves up to 1.18x end-to-end speedups by losslessly compressing communication collectives, without sacrificing model quality.
LLMs trained with ScaleBox, a new high-fidelity code verification system, substantially outperform those trained with heuristic matching, suggesting current RLHF methods are bottlenecked by verification quality.
Real-time, GPU-accelerated Monte Carlo simulation makes probabilistic safety guarantees for Automatic Emergency Braking systems deployable, not just a validation afterthought.
Training LLMs on ultra-long contexts just got a whole lot easier: AutoSP automates sequence parallelism and activation checkpointing, boosting context length by up to 2.7x with negligible throughput cost.
Even with adversarial network changes and only local signals, surprisingly simple distributed algorithms can enable dynamic networks to self-organize and adapt to changing environmental goals.
Slash MoE serving costs by two-thirds with FaaSMoE, a serverless architecture that dynamically scales experts on demand.
Stop recomputing the same quantum circuits: a semantic cache slashes redundant simulations by up to 92% and speeds up real quantum hardware by 11x.
Fixing your parallelism strategy while tuning batch size (or vice versa) leaves performance on the table: COPUS adaptively co-tunes both for faster LLM training.
Asynchronous RL for LLMs doesn't have to sacrifice convergence for speed: DORA achieves 2-4x faster training by cleverly managing multiple policy versions during rollout.
Training complex multi-agent RL policies just got 3,500x faster thanks to a new engine that optimizes for memory access and data locality.
Dense matrix multiplication accelerators can surprisingly outperform dedicated sparse accelerators for sparse neural networks, offering better area and energy efficiency.
A holistic, industrial-grade V&V loop promises to accelerate and de-risk RISC-V chip design by integrating RTL validation, FPGA-based system-level testing, and continuous integration.
Emulating massive multi-core systems just got easier: EMiX lets you scale RISC-V emulation across multiple FPGAs without rewriting your RTL.
Forget grid search: LLMs can rapidly find energy-efficient inference parameters, outperforming traditional optimization methods with just a few human-guided prompts.
Replaying CI failures in embedded systems is now possible at scale: PhantomRun reconstructs over 90% of failing builds, opening the door to systematic debugging and failure analysis.
Stop sweating LLM migrations: this Bayesian framework lets you confidently swap models in production, even with limited human evals.
Ignoring the nuanced interplay between services and hosts in microservice architectures leaves nearly 50% of root causes undiscovered.
Overlapping validation and private-data acquisition of successive blocks with state-consistency checks and ledger updates can almost double Hyperledger Fabric's commit throughput.
Fine-tune massive LLMs like Qwen3-235B with 31K context on a single 8x RTX 4090 server, thanks to a novel pipeline schedule that eliminates the weight binding bottleneck.
Automated testing of dynamic resource management frameworks in HPC is now possible, catching faults earlier and simplifying maintenance.
Speculative decoding, typically used post-RL, can be integrated directly into RL training loops to accelerate LLM rollout generation by up to 2.5x.
Forget brute-force scaling: smarter tile and tensor mapping on 3D-stacked chips could unlock massive LLM inference gains.
Unlock 3x higher throughput in your data center by easily converting MPI applications to malleable jobs with a new library.
Edge LLM inference gets a serious speed boost: DUAL-BLADE's dual-path KV cache slashes latency by up to 42% and doubles SSD utilization.
Training a 1024-node SOM on a billion-sample dataset in just over 6 minutes shatters previous scalability limits, thanks to a novel framework that leverages multi-GPU execution, out-of-memory streaming, and flexible topologies.
MPI malleability can cut HPC workload times by over 25% in real-world conditions, but only if you account for parallel efficiency.
Squeeze more out of your hardware: TSP lets you shard both weights and activations across the same devices, unlocking memory savings for long-context training and inference.
Fine-tuning LLMs in federated settings just got easier: SplitFT lets clients adapt their cut layers and LoRA ranks, boosting performance and slashing communication costs.
Squeezing high-accuracy LLMs and VLMs onto client devices is now significantly more feasible, thanks to a new pipelined sharding technique that achieves up to 30x speedups and 10x VRAM reduction.
NVIDIA's closed-source driver secrets are out: researchers can now see the exact hardware commands triggered by CUDA code.
Tomorrow's 6G networks hinge on overcoming the design hurdles of mm-wave and sub-THz oscillators, and this review lays out the roadmap.
Naive RAPL-based energy monitoring can add nearly 50% overhead to your measurements, but optimized tools can keep it negligible.
Quantum Gatekeeper achieves near-perfect information hiding: without all four factors (password, shared secret, context string, and reference image signature), payload extraction fails silently, preventing even partial disclosure.
FlyClient, a lightweight blockchain verification protocol, gets closer to real-world deployment with a practical Zcash implementation and proof-size optimizations.
Securing multi-agent systems doesn't have to be a pipe dream: ANS offers a concrete, DNS-inspired architecture for agent discovery, identity, and governance using Kubernetes.
A modified Particle Swarm Optimization algorithm slashes computation offloading latency in vehicular networks, outperforming brute-force methods in dynamic, real-world scenarios.
Squeezing more out of 5G video calls is possible: StreamGuard boosts video conferencing quality by up to 70% by intelligently prioritizing different parts of the video stream.
Federated learning can achieve better accuracy-efficiency trade-offs under heterogeneous data by optimizing within a low-dimensional subspace and using a backfill-style update to retain residual components.
Patchwork learning gets a boost: GraphPL uses GNNs to flexibly integrate all observed modalities, achieving SOTA imputation performance even with noisy inputs.
SignSGD can beat Adam and even SGD with a few simple tweaks, proving that 1-bit quantization doesn't have to mean sacrificing accuracy.
Stop wasting bandwidth on irrelevant tokens: Fed-FSTQ uses Fisher information to selectively quantize and transmit only the most important tokens, slashing communication costs in federated LLM fine-tuning by up to 46x.
Compound AI systems can achieve nearly 4x throughput improvement and cut tail latency in half with a modular, autoscaling inference architecture.
Quantum-inspired attention networks can significantly improve task offloading performance in MEC networks, offering a practical path to more energy-efficient and sustainable edge computing.
Give undergrads supercomputer access, and they'll actually grok parallel computing.
Fresh masking between pipeline stages in NTT-based post-quantum crypto isn't just good practice, it's provably necessary to erase vulnerabilities arising from prior stages, as demonstrated with a machine-checked proof and a real-world hardware flaw.
Organizational coupling in microservices isn't just about architecture – it's heavily influenced by the "Connector" roles bridging organizational silos, suggesting targeted interventions are possible.
Automating system-level testing for distributed robotics is now more practical with a new language that handles complexity, non-determinism, and dynamic reconfiguration.
Prioritizing tiny objects on edge devices isn't just about detector accuracy; DenseScout shows that a lightweight, dense-response selector coupled with transport-aware runtime can drastically outperform traditional detectors under strict compute and latency budgets.
Federated LLM inference gets a speed boost: SpecFed's speculative decoding and compressed communication slashes latency without sacrificing generation quality.
Shifting computing workloads to periods of high renewable energy availability slashes both carbon emissions and operational costs for HPC clusters.
Unlock significant speedups in depthwise convolutions (up to 3.26x) with optimized CUDA kernels, even in restricted cloud environments lacking hardware performance counters.
FusionCIM slashes LLM inference energy costs by nearly 4x while doubling processing speed, setting a new benchmark for efficiency in AI hardware.