Inference & Quantization
InfrastructureModel compression, quantization, pruning, distillation, and efficient inference for deployment.
Keywords
Top Labs in This Topic
Recent Papers
This paper introduces ChannelMamba, a novel end-to-end architecture for channel state information (CSI) prediction in 6G massive MIMO IoT systems, addressing the limitations of Transformers in handling high-dimensional, long-sequence channel data. ChannelMamba leverages a dual-domain input module processing both frequency-domain CSI and delay-domain CIR data, a cross-path parameter-sharing strategy for Mamba modules, and a bidirectional Mamba module with lightweight attention for cross-feature modeling. Experimental results demonstrate that ChannelMamba achieves state-of-the-art performance in channel prediction accuracy, robustness, generalization, and computational efficiency compared to existing methods.
Introduces ChannelMamba, a specialized Mamba-based architecture incorporating dual-domain input, cross-path parameter sharing, and bidirectional Mamba modules with attention, to achieve state-of-the-art performance in channel prediction for 6G MIMO-IoT.
This paper introduces a model-hardware co-design framework for CNN-based SAR ATR that jointly optimizes adversarial robustness, model compression, and FPGA accelerator design. The framework uses hardware-guided structured pruning, informed by a hardware performance model, to explore robustness-efficiency trade-offs. Experiments on MSTAR and FUSAR-Ship datasets show the framework produces models up to 18.3x smaller with 3.1x fewer MACs while preserving robustness, and the FPGA implementation achieves significant latency and energy efficiency improvements compared to CPU/GPU baselines.
Develops a model-hardware co-design framework that unifies robustness-aware model compression and FPGA accelerator design for CNN-based SAR ATR, enabling exploration of robustness-efficiency trade-offs.
This paper reviews deep learning (DL) approaches for hepatocellular carcinoma (HCC) prediction, highlighting the need for efficient architectures to overcome computational limitations hindering real-world deployment. It discusses lightweight models like MobileNet and EfficientNet, model compression techniques, and data-efficient methods, as well as hybrid approaches to reduce computational load. The review emphasizes the importance of rigorous validation, bias audits, privacy-preserving strategies, and seamless integration into clinical workflows for safe and scalable clinical translation of DL-based HCC prediction.
Synthesizes current advances in efficient deep learning for HCC prediction, identifies persistent challenges, and provides guidance for developing clinically relevant and broadly deployable systems.
The paper introduces Seq2Seq2Seq, a novel lossless compression method using a T5 language model architecture trained with reinforcement learning to compress data into discrete token sequences. This approach preserves the token-based structure of the original data, unlike autoencoders that use continuous latent spaces, leading to improved compression ratios. The model is trained using an off-policy reinforcement learning algorithm to optimize sequence length for minimal redundancy.
Introduces Seq2Seq2Seq, a lossless compression method that leverages reinforcement learning to train a T5 language model to compress data into discrete token sequences, preserving the original token structure.
The paper investigates test-time scaling strategies for web agents in multi-step tasks, finding that uniform scaling saturates quickly and LLM-based arbiters can overrule high-consensus decisions. They demonstrate that uncertainty statistics from the agent's vote distribution correlate with task success, enabling dynamic compute allocation. Based on these findings, they introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are contentious, improving performance and efficiency.
Introduces Confidence-Aware Test-Time Scaling (CATTS), a novel method for dynamically allocating compute to web agents based on vote-derived uncertainty, achieving improved performance and efficiency compared to uniform scaling.
The paper introduces Moonshine v2, an ergodic streaming encoder ASR model designed for latency-critical speech applications, particularly on resource-constrained edge devices. It addresses the latency issues of full-attention Transformer encoders by employing sliding-window self-attention, enabling bounded, low-latency inference while maintaining strong local context. Experiments demonstrate that Moonshine v2 achieves state-of-the-art word error rates on standard benchmarks, matching the accuracy of models six times larger while running significantly faster.
Introduces an ergodic streaming encoder ASR model, Moonshine v2, that uses sliding-window self-attention to achieve low-latency and high accuracy for on-device speech recognition.
The paper introduces KAN-FIF, a lightweight neural network architecture leveraging Kolmogorov-Arnold Networks (KANs) with spline parameterization to estimate tropical cyclone intensity from meteorological satellite data. KAN-FIF addresses the limitations of existing physics-guided models, which suffer from high parameter counts and computational inefficiency due to their inability to capture complex feature interactions. Experiments demonstrate that KAN-FIF achieves superior accuracy with significantly reduced parameters and faster inference speed compared to baseline models like Phy-CoCo, making it suitable for deployment on resource-constrained edge devices.
Introduces KAN-FIF, a novel and lightweight neural network architecture for tropical cyclone intensity estimation that integrates spline-parameterized KAN layers to efficiently capture complex feature interactions.
The paper introduces a pedagogically-inspired knowledge distillation framework (IOA) for transferring knowledge from large language models (LLMs) to smaller student models. The framework incorporates Bloom's Mastery Learning Principles and Vygotsky's Zone of Proximal Development to dynamically identify knowledge deficiencies, organize knowledge delivery through progressive curricula, and adapt representations. Experiments using LLaMA and Qwen models demonstrate that IOA significantly outperforms baseline distillation methods, achieving higher performance on DollyEval, MATH, and HumanEval benchmarks while using significantly fewer parameters.
Introduces a novel three-stage knowledge distillation framework (IOA) that incorporates pedagogical principles to systematically improve student model performance by identifying knowledge gaps, organizing knowledge delivery, and adapting representations.
The paper introduces MUSE, a multi-tenant model serving framework designed to address the challenge of threshold recalibration in Score-as-a-Service environments caused by model updates. MUSE decouples model scores from client decision boundaries using dynamic intent-based routing and a two-level score transformation to map model outputs to a stable reference distribution. Deployed at Feedzai, MUSE significantly reduces model lead time from weeks to minutes, processing over a thousand events per second across dozens of tenants, leading to substantial savings in fraud losses and operational costs.
Introduces a multi-tenant model serving framework, MUSE, that enables seamless model updates by decoupling model scores from client decision boundaries through dynamic intent-based routing and score transformation.
The paper addresses the computational inefficiency of evolutionary AI agents that repeatedly invoke LLMs by proposing AdaptEvolve, a framework for adaptive LLM selection during evolutionary refinement. AdaptEvolve uses intrinsic generation confidence to estimate real-time solvability and dynamically selects an LLM appropriate for the current generation step. Experiments demonstrate that confidence-driven selection achieves a better Pareto frontier, reducing inference costs by 37.9% while maintaining 97.5% of the accuracy of static large models.
Introduces AdaptEvolve, a novel adaptive LLM selection framework for evolutionary AI agents that leverages intrinsic generation confidence to dynamically choose the most efficient LLM for each generation step.
This paper investigates the latency overhead introduced by enabling optional security controls on disaggregated 5G Radio Access Network (RAN) interfaces and the user plane. The authors implemented a testbed with a disaggregated RAN and standardized security mechanisms to measure the impact of cryptographic operations on latency. Results indicate that while disaggregated RANs maintain a latency advantage over monolithic designs even with security enabled, achieving sub-1ms round-trip times is difficult due to the cryptographic overhead.
Quantifies the latency overhead of optional security mechanisms in a disaggregated 5G RAN, demonstrating the trade-offs between security and ultra-low latency.
This paper introduces a continuous learning architecture for edge-based malware detection that leverages LoRA adapters to enable local adaptation and global knowledge sharing in resource-constrained environments. The approach fine-tunes lightweight transformer models (DistilBERT, DistilGPT-2, TinyT5) locally on edge devices and aggregates/redistributes only the LoRA modules, avoiding the exchange of raw data. Experiments on Edge-IIoTset and TON-IoT datasets demonstrate that this LoRA-based exchange improves accuracy by 20-25% when encountering unseen attacks, while maintaining stable performance and adding minimal overhead to model size.
Proposes a parameter-efficient continuous learning framework for edge-based malware detection that uses LoRA to facilitate knowledge sharing between edge devices without transmitting raw data.
The paper introduces Multi-Level Compression Cross Networks (MLCC) and its multi-channel extension (MC-MLCC) to efficiently model high-order feature interactions in recommender systems. MLCC uses hierarchical compression and dynamic composition to capture feature dependencies with favorable computational complexity, while MC-MLCC decomposes feature interactions into parallel subspaces for efficient horizontal scaling. Experiments on public and industrial datasets demonstrate that MLCC and MC-MLCC outperform DLRM-style baselines, achieving up to 0.52 AUC improvement and up to 26x reduction in parameters and FLOPs, and the approach has been adopted in Bilibili's advertising system.
Introduces a novel feature interaction architecture, MLCC, that uses hierarchical compression and dynamic composition to efficiently capture high-order feature interactions, along with its multi-channel extension, MC-MLCC, for improved scalability.
This paper introduces RooflineBench, a benchmarking framework for on-device LLMs based on the Roofline model, using operational intensity (OI) to unify architectural primitives and hardware constraints. They define an inference-potential region and introduce Relative Inference Potential to compare LLM efficiency on the same hardware. Empirical analysis reveals that sequence length significantly influences performance and OI, identifies OI regression with model depth, and demonstrates how structural refinements like M-LA can unlock inference potential.
Introduces RooflineBench, a novel benchmarking framework leveraging Roofline analysis and operational intensity to evaluate and optimize on-device LLM performance across diverse hardware platforms.
The paper introduces MING, an MLIR-based framework for automating the HLS design process of CNNs targeting resource-constrained edge FPGAs. MING employs a streaming architecture with optimized buffer management to address the limitations of existing frameworks in handling stringent resource constraints. Experiments demonstrate that MING achieves significant speedups (15x for multi-layer CNN kernels and up to 200x for single-layer kernels) and can generate efficient designs for larger input sizes where other frameworks fail.
Introduces an MLIR-based framework, MING, that automates HLS design for CNNs on resource-constrained edge FPGAs using a streaming architecture with optimized buffer management.
The paper introduces PASCAL, a phase-aware scheduling algorithm designed to optimize the serving of reasoning-based LLMs by explicitly differentiating and prioritizing the reasoning phase to minimize Time-To-First-Token (TTFT). PASCAL employs a hierarchical scheduler with instance-level placement, intra-instance execution management, and dynamic migration at phase boundaries to balance load and reduce interference. Experiments using DeepSeek-R1-Distill-Qwen-32B show that PASCAL reduces tail TTFT by up to 72% while preserving answering phase SLO attainment, highlighting the benefits of phase-aware scheduling.
Introduces a phase-aware scheduling algorithm, PASCAL, that optimizes LLM serving by prioritizing the reasoning phase to reduce TTFT and employing controlled preemption and token pacing during the answering phase to maintain QoE.
The paper introduces OServe, a novel LLM serving system designed to address spatial and temporal heterogeneity in LLM workloads by enabling heterogeneous and flexible model deployments. OServe employs a workload-aware scheduling algorithm to optimize model deployment based on real-time workload characteristics and uses a workload-adaptive switching method to migrate model deployments in response to predicted workload changes. Experiments using real-world traces demonstrate that OServe achieves up to a 2x (average 1.5x) performance improvement compared to existing LLM serving systems.
Introduces a spatial-temporal workload orchestration framework, OServe, that dynamically adapts model deployment to heterogeneous and time-varying LLM workloads.
This paper introduces Trajectory Self-Distillation (T3D), a novel framework for improving the generation quality of few-step Diffusion Language Models (DLLMs) by distilling the model's own generative trajectories. T3D incorporates Direct Discriminative Optimization (DDO), a reverse-KL objective, to encourage mode-seeking behavior during distillation, focusing the student model on high-probability regions of the teacher model's output space. Experiments across various benchmarks demonstrate that T3D significantly outperforms existing few-step DLLM baselines, substantially reducing the performance gap with full-step decoding.
Introduces a trajectory self-distillation framework, T3D, that leverages direct discriminative optimization to improve the generation quality of few-step diffusion language models.
This paper presents an anatomical analysis of text prompting within vision-language segmentation models, specifically SAM3, revealing significant redundancy in text encoder utilization. Based on these findings, they propose SAM3-LiteText, a lightweight text encoding framework that replaces the original SAM3 text encoder with a compact MobileCLIP student. Experiments demonstrate that SAM3-LiteText reduces text encoder parameters by up to 88% while maintaining segmentation performance on image and video segmentation benchmarks.
Introduces SAM3-LiteText, a distilled MobileCLIP-based text encoder, to significantly reduce the computational and memory overhead of SAM3's text encoder without sacrificing segmentation accuracy.
The authors extend the Puzzle post-training neural architecture search framework to optimize the gpt-oss-120B model, creating gpt-oss-puzzle-88B, by combining heterogeneous MoE expert pruning, selective attention replacement, FP8 quantization, and post-training reinforcement learning. This optimized model achieves significant per-token throughput speedups (up to 2.82X on a single H100 GPU) while maintaining or slightly exceeding the parent model's accuracy across various benchmarks. The paper advocates for request-level efficiency metrics to account for varying token counts and demonstrates that gpt-oss-puzzle-88B improves request-level efficiency by up to 1.29X.
Introduces a pipeline combining heterogeneous MoE expert pruning, selective attention replacement, FP8 quantization, and post-training reinforcement learning within the Puzzle framework to optimize large language models for inference.
This paper introduces a lightweight framework for predicting LLM output length by reusing the main model's internal hidden states, addressing the computational waste caused by excessive padding in batched inference. The framework consists of Entropy-Guided Token Pooling (EGTP) for static prediction and Progressive Length Prediction (PLP) for dynamic estimation during stochastic generation. Experiments on the newly introduced ForeLen benchmark demonstrate that EGTP achieves state-of-the-art accuracy, reducing MAE by 29.16\% compared to existing methods, and improves end-to-end throughput when integrated with a length-aware scheduler.
Proposes a novel and efficient framework for LLM output length prediction that leverages entropy-guided token pooling and progressive length prediction to improve accuracy and reduce computational overhead.
The paper introduces Categorical Flow Maps, a flow-matching method designed for fast, few-step generation of categorical data using self-distillation. By defining a continuous flow map towards the simplex, the method transports probability mass to a predicted endpoint, enabling the use of distillation techniques and a novel endpoint consistency objective. Experiments demonstrate state-of-the-art few-step generation performance across images, molecular graphs, and text, even achieving strong results in single-step generation.
Introduces a continuous flow-matching formulation for categorical data generation that enables self-distillation and endpoint consistency training, leading to accelerated sampling.
This paper addresses the computational bottleneck introduced by post-quantum cryptography (PQC) in Open Radio Access Networks (O-RAN) control planes, which impacts energy efficiency. They propose an energy-aware framework with a Crypto Policy rApp and a Security Operations Scheduling (SOS) xApp to strategically manage PQC suites and optimize cryptographic enforcement timing and placement. Through discrete-event simulation, the proposed scheduling approach achieves a 60% reduction in per-handshake energy consumption without compromising slice latency targets.
Introduces an energy-aware scheduling framework for PQC handshakes in O-RAN that minimizes energy consumption while meeting slice latency requirements.
The paper introduces EqDeepRx, a deep-learning-aided MIMO receiver that combines linear processing with learned components for improved scaling and generalization. EqDeepRx employs a shared-weight DetectorNN operating on individual spatial streams to achieve near-linear complexity scaling with multiplexing order, and uses a DenoiseNN to enhance channel estimation. End-to-end simulations demonstrate that EqDeepRx achieves improved error rate and spectral efficiency compared to conventional receivers while maintaining low complexity and supporting various MIMO configurations without retraining.
Introduces a novel deep-learning-aided MIMO receiver architecture, EqDeepRx, that achieves near-linear complexity scaling with multiplexing order through a shared-weight DetectorNN and enhances generalization via a DenoiseNN.
This paper compares MAP and LMMSE estimators for blind deconvolution problems, focusing on scenarios with full knowledge of signal and kernel distributions. It finds that MAP estimators are unstable and require extensive tuning, even in controlled settings, while LMMSE provides a robust baseline. The study also demonstrates that LMMSE solutions can effectively initialize MAP methods, improving their performance and stability.
Empirically demonstrates the instability of MAP estimators compared to LMMSE in blind deconvolution and shows that LMMSE can effectively initialize MAP methods.
This paper introduces dVoting, a novel test-time technique for Diffusion Large Language Models (dLLMs) that leverages their parallel decoding capabilities to enhance reasoning. dVoting iteratively refines token predictions by sampling multiple outputs, identifying inconsistent tokens, and regenerating them through a voting mechanism until convergence. Experiments on GSM8K, MATH500, ARC-C, and MMLU demonstrate consistent performance improvements, highlighting the potential of dVoting to boost dLLM reasoning without additional training.
Introduces dVoting, a parallelizable, training-free voting technique that leverages the unique capabilities of dLLMs to iteratively refine and improve reasoning performance by focusing on uncertain tokens.
This paper introduces an energy-aware spike budgeting framework for continual learning in spiking neural networks (SNNs) to address catastrophic forgetting while optimizing for energy efficiency. The framework combines experience replay, learnable LIF neuron parameters, and an adaptive spike scheduler to enforce dataset-specific energy constraints during training. Results show that spike budgeting acts as a sparsity-inducing regularizer on frame-based datasets, improving accuracy and reducing spike rates, while controlled budget relaxation enables accuracy gains on event-based datasets.
Introduces an energy-aware spike budgeting framework that adaptively controls spike rates during continual learning in SNNs to improve both accuracy and energy efficiency across frame-based and event-based neuromorphic vision datasets.
This paper investigates the relationship between performance antipatterns and energy consumption in microservice architectures by implementing ten common antipatterns as isolated microservices and measuring their performance, CPU/DRAM power consumption, and resource utilization under controlled load. The study reveals that while all implemented antipatterns degrade performance, only a subset significantly increase power consumption, with some reaching CPU saturation and others exhibiting energy-performance coupling. The findings provide a basis for identifying performance antipatterns that also act as energy antipatterns, offering insights for energy-efficient microservice design.
Empirically demonstrates that not all performance antipatterns in microservices lead to increased power consumption, identifying specific cases where performance degradation does not correlate with higher energy usage due to CPU saturation effects.
This paper investigates the phenomenon of "token overflow" in soft compression architectures for retrieval-augmented generation (RAG), where compressed token representations lose task-relevant information. They propose a methodology to characterize and detect token overflow, evaluating it within the xRAG framework. Their key finding is that lightweight probing classifiers, leveraging both query and context xRAG representations, achieve an average AUC-ROC of 0.72 in detecting overflow across HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating the importance of query-aware detection.
Introduces a methodology using lightweight probing classifiers to detect token overflow in compressed token representations for retrieval-augmented generation by leveraging query and context information.
The paper introduces DEL, a framework for differentially private and communication-efficient split inference of large language models (LLMs). DEL uses an embedding projection module and differentially private stochastic quantization to reduce communication overhead while preserving privacy. It then employs soft prompts on the server side to mitigate utility degradation caused by the privacy mechanisms, eliminating the need for local models.
Introduces a novel framework, DEL, that leverages soft prompts to improve the privacy-utility trade-off in LLM split inference, achieving differential privacy and communication efficiency.
The paper addresses the problem of excessive and unnecessary reflection in Large Reasoning Models (LRMs) that leads to increased token consumption and computational overhead without improving accuracy, especially in smaller models. To mitigate this, they propose Adaptive Reflection and Length Coordinated Penalty (ARLCP), a reinforcement learning framework that dynamically balances reasoning efficiency and solution accuracy by introducing reflection and length penalties. Experiments on mathematical reasoning benchmarks using DeepSeek-R1-Distill-Qwen-1.5B and 7B models demonstrate that ARLCP achieves a superior efficiency-accuracy trade-off, reducing response length by up to 53.1% while improving accuracy by up to 5.8%.
Introduces ARLCP, a novel reinforcement learning framework with adaptive reflection and length penalties, to train LRMs for efficient reasoning by curtailing unnecessary reflective steps while preserving essential reasoning.
The paper introduces PrefillShare, an algorithm for sharing the prefill stage across multiple language models in disaggregated serving environments to reduce redundant computation and KV cache storage. PrefillShare factorizes models into prefill and decode modules, freezes the prefill module, and fine-tunes only the decode module, enabling multiple models to share a prefill module and its KV cache. Experiments demonstrate that PrefillShare achieves comparable accuracy to full fine-tuning while significantly improving latency (4.5x lower p95) and throughput (3.9x higher) in multi-model agent workloads.
Introduces PrefillShare, a novel algorithm that enables efficient sharing of the prefill stage and KV cache across multiple language models in a disaggregated serving system.
The paper introduces LASER, a full-stack optimization framework for efficient long sequence modeling in recommendation systems, addressing I/O and computational bottlenecks. LASER incorporates SeqVault, a hybrid DRAM-SSD indexing strategy, to reduce retrieval latency, and Segmented Target Attention (STA), a novel attention mechanism with a sigmoid-based gating strategy and Global Stacked Target Attention (GSTA), to reduce computational complexity. Online A/B testing showed LASER achieved significant improvements in ADVV and revenue, demonstrating its practical impact.
Introduces a full-stack optimization framework, LASER, featuring SeqVault and Segmented Target Attention (STA), to achieve efficient long sequence modeling for recommendation systems.
This paper introduces Processing Across Memory (PAM), a KV-centric LLM serving system designed to address the memory bandwidth and capacity bottlenecks in LLM serving. PAM employs a hierarchical memory architecture with heterogeneous PIM-enabled devices, distributing KV tokens based on context locality and introducing the PAMattention algorithm for parallel attention computation. The system further incorporates dynamic KV scheduling and migration to balance computational workloads across devices, leading to enhanced efficiency and scalability.
Introduces a hierarchical memory architecture and associated algorithms for LLM serving that coordinates heterogeneous PIM-enabled memory devices to balance high memory bandwidth with scalable capacity.
The paper introduces On-Policy Context Distillation (OPCD), a method for distilling in-context knowledge into language models by training a student model on its own generated trajectories. OPCD minimizes the reverse Kullback-Leibler divergence between the student's output and a context-conditioned teacher model, effectively bridging on-policy and context distillation. Experiments across mathematical reasoning, text-based games, and domain-specific tasks demonstrate that OPCD outperforms baselines in task accuracy and out-of-distribution generalization, while also enabling effective cross-size distillation.
Introduces On-Policy Context Distillation (OPCD), a novel framework for language model distillation that leverages on-policy training with reverse KL divergence to internalize in-context knowledge.
This paper introduces Arbitrary Ratio Feature Compression (ARFC), a novel framework for compressing features to arbitrary ratios using a single model based on next-token prediction. The core of ARFC is an auto-regressive model that controls the compression ratio by adjusting the number of generated tokens during inference. To improve compressed feature quality, the framework incorporates a Mixture of Solutions (MoS) module and an Entity Relation Graph Constraint (ERGC) during training, resulting in state-of-the-art performance across various tasks and compression ratios.
Introduces a flexible feature compression framework, ARFC, that achieves arbitrary compression ratios with a single model by framing compression as a next-token prediction task.
This paper introduces Generalized On-Policy Distillation (G-OPD), a framework extending standard on-policy distillation by incorporating a flexible reference model and a reward scaling factor to control the reward term's weight against KL regularization. The authors theoretically demonstrate that standard OPD is a specific instance of dense KL-constrained RL and empirically show that reward extrapolation (ExOPD), where the reward scaling factor is greater than 1, consistently improves performance over standard OPD, even enabling the student to surpass the teacher's performance. Furthermore, they find that reward correction using the teacher's base model before RL as the reference model in strong-to-weak distillation further enhances performance.
Proposes Generalized On-Policy Distillation (G-OPD), a novel framework that extends standard OPD with a flexible reference model and reward scaling, enabling reward extrapolation and improved distillation performance.
The paper introduces GORGO, a method for cross-region LLM load balancing that minimizes Time-to-First-Token (TTFT) by jointly optimizing for compute availability, network latency, and KV-cache reuse. GORGO models a total serving cost function and uses it to make routing decisions, addressing the limitations of existing approaches that either ignore network latency or suffer from synchronization overhead. Experiments on custom infrastructure demonstrate that GORGO reduces P99 TTFT through network-aware routing and achieves a 2.5x speedup in median TTFT compared to prior methods by using a centralized HTTP proxy.
Introduces a network-aware routing policy, GORGO, that minimizes TTFT in cross-region LLM inference by optimizing a cost function that considers compute, network latency, and KV-cache reuse.
This paper introduces a generative compression framework for image denoising that prioritizes perceptual realism by reconstructing images from entropy-coded latent representations. The approach uses both a conditional Wasserstein GAN (WGAN) and a conditional diffusion model, each guided by compressed latents and perceptual losses like LPIPS and Wasserstein distance, to balance rate, distortion, and perception. Empirical results on synthetic and real-noise datasets show improved perceptual quality and competitive distortion performance, and theoretical analysis provides non-asymptotic guarantees for a compression-based maximum-likelihood denoiser.
Proposes a novel image denoising framework based on generative compression, leveraging entropy-coded latent representations and perceptual losses to achieve a better trade-off between perceptual quality and distortion.
This paper introduces DeepFusionKernel, a deeply fused kernel designed to optimize the memory bandwidth bottleneck caused by large SwiGLU MLP blocks in agentic LLM inference with long contexts. By reducing HBM traffic and improving cache reuse, DeepFusionKernel significantly accelerates inference. Experiments demonstrate speedups of up to 13.2% on H100 and 9.7% on A100 GPUs compared to SGLang.
Introduces a deeply fused kernel, DeepFusionKernel, that optimizes memory bandwidth usage for SwiGLU MLP blocks in transformer models, leading to faster inference.
The paper introduces Random Access Memory Network (RAM-Net), a novel linear attention architecture that addresses the expressivity limitations of fixed-size memory by mapping inputs to high-dimensional sparse vectors that serve as explicit addresses for a large memory state. This design enables exponential state size scaling without increasing the number of parameters, thereby reducing signal interference and improving retrieval fidelity. Experiments show that RAM-Net outperforms state-of-the-art baselines in long-range retrieval tasks and achieves competitive performance in language modeling and zero-shot commonsense reasoning.
Introduces RAM-Net, a linear attention architecture that uses sparse, high-dimensional vectors as explicit memory addresses to enable exponential state scaling without increasing parameters.
This paper investigates the impact of bit allocation strategies on the performance of world model-based planning, specifically using DINO-WM on the Wall planning task. The study compares uniform, mixed, asymmetric, and layerwise quantization schemes under different planner budgets to identify critical bitwidth thresholds. Results show a sensitivity to bit allocation in the 4-bit regime, with encoder precision being particularly important for maintaining performance, suggesting the need for module-aware quantization policies.
Demonstrates that, in low-bit world model planning, performance is sensitive to bit allocation, particularly in the encoder, and identifies a critical 4-bit transition regime where module-aware quantization becomes crucial.
The paper introduces Region-to-Image Distillation, a method that distills the benefits of iterative zooming into a single forward pass of an MLLM for improved fine-grained multimodal perception. This is achieved by training a student model on VQA data generated by a teacher model that has zoomed into micro-cropped regions. The approach is evaluated on a new benchmark, ZoomBench, and demonstrates improved performance on fine-grained perception tasks and general multimodal cognition, without the latency overhead of iterative zooming.
Introduces Region-to-Image Distillation to internalize the benefits of agentic zooming into a single forward pass of an MLLM, eliminating the need for iterative tool calls during inference.
This paper introduces Cachemir, a novel framework for fully homomorphic encrypted (FHE) inference of generative LLMs that addresses the inefficiency of integrating KV caches in existing FHE solutions. Cachemir achieves this by developing HE packing algorithms tailored for KV cache utilization, an interleaved replicated packing algorithm for efficient vector-matrix multiplications, and an augmented bootstrapping placement strategy to minimize bootstrapping costs. Experiments show that Cachemir significantly outperforms state-of-the-art FHE inference frameworks like MOAI and THOR, achieving up to 67x speedup and generating tokens for Llama-3-8B in under 100 seconds on GPU.
Introduces a novel fully homomorphic encryption (FHE) inference framework, Cachemir, that significantly accelerates generative LLM inference by efficiently integrating and optimizing the KV cache.
The paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that combines sparse attention (InfLLM-V2) and linear attention (Lightning Attention) to improve long-context modeling efficiency. A layer selection algorithm integrates the two attention mechanisms in a 1:3 ratio, along with a hybrid positional encoding (HyPE), to maintain performance while improving efficiency. The paper also presents a cost-effective continual training framework that transforms pre-trained Transformer models into hybrid models, reducing training costs by 75% and enabling the model to achieve 3.5x faster inference speeds at 256K sequence length and supporting context lengths up to 1M tokens on a single NVIDIA A6000D GPU.
Introduces a hybrid sparse and linear attention architecture, MiniCPM-SALA, that achieves efficient long-context modeling with minimal performance degradation compared to full-attention models.
The paper introduces Consolidation-based Routing for Adaptive Memory (CRAM), a novel memory consolidation mechanism inspired by biology, designed to reduce attention computation in hybrid architectures by distilling episodic retrievals into parametric semantic memory over time. Analyzing GPT-2 models, the authors found that a significant portion of attention operations retrieve predictable information, motivating the development of a system that decreases attention utilization during training. CRAM achieves a 37.8x reduction in attention compute with 100% retrieval accuracy on a proposed SRCD benchmark, and exhibits consolidation dynamics that align with human episodic-to-semantic memory transition curves.
Introduces a biologically-inspired memory consolidation mechanism, CRAM, that adaptively reduces attention computation by distilling episodic retrievals into parametric semantic memory.
The paper introduces PACE, a dual-level framework for compressing reasoning traces in Language Reasoning Models (LRMs) by addressing overthinking and excessive token usage. PACE employs prefix-protected optimization at the sequence level using decaying mixed rollouts to preserve valid reasoning paths while encouraging conciseness, and difficulty-aware penalty at the group level to dynamically adjust length constraints based on query complexity. Experiments on DeepSeek-R1-Distill-Qwen models (1.5B/7B) demonstrate that PACE achieves up to 55.7% token reduction and up to 4.1% accuracy improvement on math benchmarks, generalizing to code, science, and general domains.
Introduces a dual-level compression framework, PACE, that combines prefix-protected optimization and difficulty-aware penalties to reduce token usage and improve accuracy in language reasoning models.
This paper presents a hardware implementation of semi-empirical electronic structure methods, specifically Extended Hückel Theory (EHT) and non-self-consistent Density Functional Tight Binding (DFTB0), on a field-programmable gate array (FPGA). By implementing Hamiltonian construction and diagonalization directly on the FPGA using a streaming dataflow architecture, the design achieves deterministic execution and eliminates host intervention. The FPGA-based DFTB0 Hamiltonian generator demonstrates a greater than fourfold throughput improvement compared to a server-class CPU on a mid-range Artix-7 FPGA, highlighting the potential for significant acceleration.
Demonstrates a hardware-native implementation of semi-empirical electronic structure theory on an FPGA, achieving superior throughput compared to a CPU.
This paper introduces MemFly, a framework for on-the-fly memory optimization in LLMs based on the information bottleneck principle. MemFly uses a gradient-free optimizer to minimize compression entropy while maximizing relevance entropy, creating a stratified memory structure. The framework incorporates a hybrid retrieval mechanism combining semantic, symbolic, and topological pathways, achieving superior performance in memory coherence, response fidelity, and accuracy compared to existing methods.
Introduces an information bottleneck-based framework, MemFly, for on-the-fly memory optimization in LLMs, enabling efficient compression and precise retrieval.
This paper introduces a real-time, low-latency Named Entity Recognition (NER) system tailored for cancer therapy-related clinical records and Traditional Chinese Medicine (TCM) using deep learning architectures. The study addresses the challenges of applying NER to complex medical terminology and the need for high accuracy in clinical contexts, particularly in cross-lingual speech-to-text applications. The authors propose a semi-supervised approach that integrates TCM-specific corpora with biomedical resources, demonstrating improved recognition accuracy for real-time clinical applications.
Introduces a semi-supervised NER approach that leverages TCM-specific corpora and biomedical resources to enhance recognition accuracy in real-time clinical applications.

