Search papers, labs, and topics across Lattice.
53 papers published across 2 labs.
Cut your 3D-QA model's token budget by 91% and latency by 86% with a new pruning method that intelligently balances semantic importance and geometric coverage.
Tucker Attention squeezes an order of magnitude more parameter efficiency out of attention layers, while unifying and simplifying Group Query Attention, Multi-Head Latent Attention, and standard Multi-Head Attention.
Achieve superior compression of wind turbine images without sacrificing defect detection accuracy by using a segmentation-guided, dual lossy/lossless compression scheme.
LLMs' skewed matrix shapes need not hamstring systolic array performance: SISA's partitioned architecture achieves up to 8.52x speedup and 93% EDP reduction compared to monolithic arrays.
Radically simpler train loading plans are now possible by implicitly modeling rehandle costs, slashing the complexity of optimization problems.
Cut your 3D-QA model's token budget by 91% and latency by 86% with a new pruning method that intelligently balances semantic importance and geometric coverage.
Tucker Attention squeezes an order of magnitude more parameter efficiency out of attention layers, while unifying and simplifying Group Query Attention, Multi-Head Latent Attention, and standard Multi-Head Attention.
Achieve superior compression of wind turbine images without sacrificing defect detection accuracy by using a segmentation-guided, dual lossy/lossless compression scheme.
LLMs' skewed matrix shapes need not hamstring systolic array performance: SISA's partitioned architecture achieves up to 8.52x speedup and 93% EDP reduction compared to monolithic arrays.
Radically simpler train loading plans are now possible by implicitly modeling rehandle costs, slashing the complexity of optimization problems.
Run multiple LoRA-tuned GenAI models on your phone without blowing up storage or latency: just swap weights at runtime.
Forget ensembles and retraining: estimate LLM uncertainty with just a single forward-backward pass by assuming parameter covariance isotropy.
You can shrink a spacecraft anomaly detection model by 97% and still catch almost all the problems.
You can shrink a privacy expert LLM by 4500x and still get human-level privacy judgments.
Commodity CPUs can be retrofitted with hardware-backed control flow attestation using hardware performance counters, enabling runtime attack detection in TEEs.
Formalizing speculative execution vulnerabilities with compositional semantics allows for automated detection and verification, moving beyond ad-hoc countermeasures.
LLM agents actually perform *better* when you strip away the majority of the boilerplate in their skill descriptions, suggesting current context windows are overloaded with irrelevant information.
Run code LLMs 10x faster and with 6x less memory on your laptop: Ditto compiles them into lean, mean, local executables.
Video Transformers can achieve near-full attention accuracy with significantly less compute by focusing only on informative vertical vectors.
LLMs can maintain conversational stability and improve retrieval accuracy in long-running interactions by adaptively compressing context, leading to reduced token usage and faster inference.
A novel data-dependency-free palette unlocks high-throughput, low-resource mezzanine coding, outperforming JPEG-XS while slashing LUT resource usage in half.
Semantic scene understanding can keep your robot from crashing when running LLMs on edge devices.
Achieve HPC acceleration by emulating FP64 operations with INT8 precision on GPUs, proving that you can boost performance *and* accuracy.
Turns out, almost half of AI assistant queries in software development are unnecessary, suggesting we're over-relying on these tools for tasks better suited to simpler solutions.
Scanning every token to focus attention is now passé: HISA prunes irrelevant context blocks *before* token-level scoring, slashing compute without sacrificing selection fidelity.
Generate or edit 1024x1024 images on your phone in under a second with DreamLite, a unified diffusion model that rivals server-side performance despite its tiny 0.39B parameters.
Stop handcuffing student diffusion models to their teachers: framing distribution matching as a reward unlocks more stable and performant distillation via RL techniques.
Guaranteeing robust distributed GenAI inference at the edge requires trust-aware routing, and G-TRAC achieves this with sub-millisecond routing latency.
Compressing 3D Gaussian Splatting just got a whole lot better: GeoHCC maintains geometric integrity and rendering fidelity by explicitly modeling inter-anchor geometric correlations, outperforming existing anchor-based approaches.
Runaway compute costs for diffusion models on GPUs? EdgeDiT slashes parameters by 30% and latency by 40% while maintaining image quality, all on your phone.
Forget pruning or quantization: MPO decomposition lets you compress a transformer by 13x while retaining 97% accuracy.
You can slash 7-14% of parameters from your SLAM-ASR system by pruning the Whisper encoder and using LoRA, even outperforming the original model in some cases.
LVLM inference is ripe for optimization, but current acceleration techniques only scratch the surface.
LLMs fix more bugs when you feed them *less* code, thanks to a new compression technique that distills context to the minimal, crucial snippets.
Skipping frames without objects boosts nano-drone object detection throughput by 24% with negligible accuracy loss.
Achieve 7.7% better compression than JPEG-XL by using a bit-depth adaptive entropy model for lossless raw image compression.
Quantization-based point cloud compression can lead to severe distortions, but this work demonstrates a new leaf node lossy compression method that significantly outperforms existing octree-based approaches for object point clouds.
Achieve FP16-level LLM accuracy at 3-bit quantization, unlocking 1.5x faster inference than 4-bit methods on consumer GPUs.
Hadamard rotations unlock near-lossless 5-bit quantization for LLMs, outperforming standard techniques without calibration data.
By cleverly repurposing an unused sign bit, IF4 achieves superior quantization performance compared to NVFP4 without increasing bit-width.
Automating the messy process of post-training quantization, OneComp lets you compress generative AI models with a single line of code.
Forget slow rotations: IsoQuant's quaternion-based approach warps RotorQuant in LLM KV cache compression, delivering up to 6x speedups on synthetic data.
Achieve secure outsourced decision tree evaluation without any communication between servers, unlocking faster and more scalable MLaaS deployments.
A 50x speedup makes VLMs fast enough to serve as a real-time semantic safety net for self-driving cars, but NF4 quantization can cause critical recall failures.
StreamingVLA achieves a remarkable 2.4x speedup and 6.5x reduction in execution halting by asynchronously parallelizing observation, action generation, and execution stages in vision-language-action models.
LLM inference bottlenecks aren't just compute-bound: heterogeneous GPU-FPGA systems can slash memory processing overheads by up to 2x while simultaneously reducing energy consumption.
Deploying transformers in real-time just got a whole lot faster: this work achieves up to 64x speedups on GPUs while maintaining accuracy through a novel hybrid precision approach.
Forget GPU-centric All-Reduce: SCIN's switch-based architecture slashes latency by up to 8.7x and boosts LLaMA-2 performance by 34% through in-network quantization.
You can boost ranking model performance in low-traffic recommendation systems by directly distilling knowledge from a large-scale, but different, domain like video recommendations.
Cutting LLM costs and ensuring zero data leakage might be two sides of the same contextual compression coin.
Inference-time hacks to boost LLM reasoning are mostly a waste of time: raw model power matters way more.
Forget selecting or merging original KV pairs – KVSculpt distills the KV cache into a smaller, optimized representation in continuous embedding space, slashing KL divergence by up to 4.1x.
Apple's own vDSP FFT library gets smoked by a new implementation that's 29% faster, thanks to a clever two-tier memory model exploiting the GPU's register file and threadgroup memory.
Ternary LLMs can run up to 62x faster on CPU and 1.9x faster on CUDA with RSR-core, a new engine that finally brings theoretically fast low-bit matrix multiplication to practical hardware.
Multi-chiplet architectures can unlock significant speedups and memory savings for low-batch MoE inference by dynamically scheduling expert computations across high-bandwidth die-to-die links.
Forget generic pre-training: Speculative decoding gets a serious speed boost when your draft model is a specialist trained on data matching the target task.
LLMs can maintain performance while processing longer contexts, thanks to a new compression method that intelligently adjusts the compression ratio based on the information density of the input.
Forget training on long videos – PackForcing achieves state-of-the-art long-video generation by cleverly compressing the KV-cache into Sink, Mid, and Recent tokens, enabling 24x temporal extrapolation from short-video training.