Architecture Design (Transformers, SSMs, MoE)

Infrastructure

Novel neural network architectures including transformer variants, state space models, mixture of experts, and attention mechanisms.

Keywords

transformerstate space modelMambamixture of expertsMoEattention mechanismarchitectureSSMlinear attention

Recent Papers

2026

ChannelMamba: A Mamba-Driven Selective State-Space Model for Channel Prediction of High-Mobility MIMO in 6G IoT

This paper introduces ChannelMamba, a novel end-to-end architecture for channel state information (CSI) prediction in 6G massive MIMO IoT systems, addressing the limitations of Transformers in handling high-dimensional, long-sequence channel data. ChannelMamba leverages a dual-domain input module processing both frequency-domain CSI and delay-domain CIR data, a cross-path parameter-sharing strategy for Mamba modules, and a bidirectional Mamba module with lightweight attention for cross-feature modeling. Experimental results demonstrate that ChannelMamba achieves state-of-the-art performance in channel prediction accuracy, robustness, generalization, and computational efficiency compared to existing methods.

Introduces ChannelMamba, a specialized Mamba-based architecture incorporating dual-domain input, cross-path parameter sharing, and bidirectional Mamba modules with attention, to achieve state-of-the-art performance in channel prediction for 6G MIMO-IoT.

Huaguang Shi, Kaibo Jin, Xiaoquan Ren +2

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

2026

MT: A Hybrid Mamba–Transformer for Remote Sensing Image Super-Resolution

This paper introduces a hybrid Mamba-Transformer (MT) framework for remote sensing image super-resolution, aiming to overcome the limitations of CNNs and transformers in capturing long-range dependencies and maintaining computational efficiency. MT combines a focused mamba block (FMB) with a snake vision state-space module (SVSSM) for global feature modeling and a pixel-adaptive block (PAB) for pixel-level multiscale enhancement. Experiments on benchmark datasets demonstrate that MT outperforms state-of-the-art methods, achieving a better trade-off between performance and computational cost, specifically reducing parameters and FLOPs compared to MambaIRv2 while improving PSNR.

Introduces a novel hybrid Mamba-Transformer architecture that leverages a snake vision state-space module within a Mamba block to improve long-range dependency modeling and reduce computational redundancy for remote sensing image super-resolution.

Jiajie Wang, Cuicui Lv, Shuzhen Xu +1

Multimodal ModelsArchitecture Design (Transformers, SSMs, MoE)

Feb 21, 2026

just now

A Model-Hardware Co-design Framework for Robust and Efficient CNN-Based SAR ATR

This paper introduces a model-hardware co-design framework for CNN-based SAR ATR that jointly optimizes adversarial robustness, model compression, and FPGA accelerator design. The framework uses hardware-guided structured pruning, informed by a hardware performance model, to explore robustness-efficiency trade-offs. Experiments on MSTAR and FUSAR-Ship datasets show the framework produces models up to 18.3x smaller with 3.1x fewer MACs while preserving robustness, and the FPGA implementation achieves significant latency and energy efficiency improvements compared to CPU/GPU baselines.

Develops a model-hardware co-design framework that unifies robustness-aware model compression and FPGA accelerator design for CNN-based SAR ATR, enabling exploration of robustness-efficiency trade-offs.

S. Wickramasinghe, Tian Ye, C. Raghavendra +1

Interpretability & Mechanistic InterpInference & QuantizationArchitecture Design (Transformers, SSMs, MoE)

Feb 15, 2026

just now

Pioneering efficient deep learning architectures for enhanced hepatocellular carcinoma prediction and clinical translation

This paper reviews deep learning (DL) approaches for hepatocellular carcinoma (HCC) prediction, highlighting the need for efficient architectures to overcome computational limitations hindering real-world deployment. It discusses lightweight models like MobileNet and EfficientNet, model compression techniques, and data-efficient methods, as well as hybrid approaches to reduce computational load. The review emphasizes the importance of rigorous validation, bias audits, privacy-preserving strategies, and seamless integration into clinical workflows for safe and scalable clinical translation of DL-based HCC prediction.

Synthesizes current advances in efficient deep learning for HCC prediction, identifies persistent challenges, and provides guidance for developing clinically relevant and broadly deployable systems.

Sami Akbulut, Cemil Çolak

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

Feb 12, 2026

2d ago

HLA: Hadamard Linear Attention

This paper introduces Hadamard Linear Attention (HLA), a novel linear attention mechanism designed to more accurately approximate softmax attention. HLA applies a nonlinearity after the computation of pairwise similarities, unlike existing linear attention methods that apply nonlinear kernel functions independently to queries and keys. The authors demonstrate that this approach results in a higher-degree rational function approximation of softmax and show its effectiveness in a large diffusion transformer model for video generation.

Introduces Hadamard Linear Attention (HLA), a linear attention variant that applies a nonlinearity after pairwise similarity computation to better approximate softmax.

Hanno Ackermann, Mohsen Ghafoorian, A. Habibian2602.12128

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & OptimizationNatural Language Processing

2d ago

Seq2Seq2Seq: Lossless Data Compression via Discrete Latent Transformers and Reinforcement Learning

The paper introduces Seq2Seq2Seq, a novel lossless compression method using a T5 language model architecture trained with reinforcement learning to compress data into discrete token sequences. This approach preserves the token-based structure of the original data, unlike autoencoders that use continuous latent spaces, leading to improved compression ratios. The model is trained using an off-policy reinforcement learning algorithm to optimize sequence length for minimal redundancy.

Introduces Seq2Seq2Seq, a lossless compression method that leverages reinforcement learning to train a T5 language model to compress data into discrete token sequences, preserving the original token structure.

Mahdi Khodabandeh, Ghazal Shabani, Arash Yousefi Jordehi +12602.12146

Architecture Design (Transformers, SSMs, MoE)Inference & QuantizationTraining Efficiency & Optimization

2d ago

Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications

The paper introduces Moonshine v2, an ergodic streaming encoder ASR model designed for latency-critical speech applications, particularly on resource-constrained edge devices. It addresses the latency issues of full-attention Transformer encoders by employing sliding-window self-attention, enabling bounded, low-latency inference while maintaining strong local context. Experiments demonstrate that Moonshine v2 achieves state-of-the-art word error rates on standard benchmarks, matching the accuracy of models six times larger while running significantly faster.

Introduces an ergodic streaming encoder ASR model, Moonshine v2, that uses sliding-window self-attention to achieve low-latency and high accuracy for on-device speech recognition.

M. Kudlur, Evan King, James Wang +12602.12241

Speech & AudioArchitecture Design (Transformers, SSMs, MoE)Inference & Quantization

2d ago

On the Sensitivity of Firing Rate-Based Federated Spiking Neural Networks to Differential Privacy

This paper investigates the impact of differential privacy (DP) mechanisms, namely gradient clipping and noise injection, on firing rate statistics within federated spiking neural networks (SNNs). The study demonstrates that DP significantly perturbs firing rates, leading to rate shifts, attenuated aggregation, and unstable client selection in a speech recognition task under non-IID data. The authors further link these rate shifts to sparsity and memory usage, providing insights into the trade-offs between privacy and performance in rate-based federated neuromorphic learning.

Quantifies the sensitivity of firing rate-based federated spiking neural networks to differential privacy mechanisms, revealing specific impacts on rate statistics, aggregation, and client selection.

M. Perkusich, Dalton Valadares, K. Gorgônio2602.12009

Training Efficiency & OptimizationDistributed Systems & HardwareArchitecture Design (Transformers, SSMs, MoE)

2d ago

LoRA-based Parameter-Efficient LLMs for Continuous Learning in Edge-based Malware Detection

This paper introduces a continuous learning architecture for edge-based malware detection that leverages LoRA adapters to enable local adaptation and global knowledge sharing in resource-constrained environments. The approach fine-tunes lightweight transformer models (DistilBERT, DistilGPT-2, TinyT5) locally on edge devices and aggregates/redistributes only the LoRA modules, avoiding the exchange of raw data. Experiments on Edge-IIoTset and TON-IoT datasets demonstrate that this LoRA-based exchange improves accuracy by 20-25% when encountering unseen attacks, while maintaining stable performance and adding minimal overhead to model size.

Proposes a parameter-efficient continuous learning framework for edge-based malware detection that uses LoRA to facilitate knowledge sharing between edge devices without transmitting raw data.

Christian Rondanini, B. Carminati, Elena Ferrari +22602.11655

Training Efficiency & OptimizationInference & QuantizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

Reliable and Private Anonymous Routing for Satellite Constellations

This paper introduces an enhanced anonymity architecture based on the Loopix mix-network, tailored for the challenges of LEO satellite constellations and mixed-trust environments. The architecture incorporates a multi-path transport protocol using (n, k) erasure codes for reliability, a computationally efficient Private Information Retrieval (PIR) protocol for route discovery, and adaptive, centrality-based delay strategies to mitigate topological bias. Packet-level simulations validate the architecture, demonstrating near-zero message loss with the multi-path transport and quantifying the overhead of the PIR protocol, showing a practical anonymity-to-latency trade-off.

Introduces a novel anonymity architecture for LEO satellite constellations that integrates multi-path transport, PIR-based route discovery, and adaptive delay strategies to enhance reliability and privacy.

Nilesh Vyas, Fabien Geyer, S. Duhovnikov2602.11764

Distributed Systems & HardwareArchitecture Design (Transformers, SSMs, MoE)

2d ago

Designing Scalable Rate Limiting Systems: Algorithms, Architecture, and Distributed Solutions

This paper presents a production-grade architecture for a distributed rate limiting system using Redis and Lua scripting, focusing on the trade-offs between accuracy and memory cost. It compares the Rolling Window algorithm's performance against Token Bucket and Fixed Window algorithms, demonstrating its accuracy with manageable memory overhead. The system employs a three-layer architecture for managing and updating rate-limiting rules, deployed on a Redis Cluster for availability and scalability.

Quantifies the accuracy and memory cost trade-off of the Rolling Window rate limiting algorithm compared to Token Bucket and Fixed Window algorithms within a production system.

Bo Guan2602.11741

Distributed Systems & HardwareArchitecture Design (Transformers, SSMs, MoE)

2d ago

RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval

The paper introduces RI-Mamba, a rotation-invariant state-space model for text-to-shape retrieval that addresses the limitations of existing methods in handling objects with arbitrary orientations and diverse categories. RI-Mamba disentangles pose from geometry using global and local reference frames and Hilbert sorting to create rotation-invariant token sequences. The model incorporates orientational embeddings via feature-wise linear modulation and employs cross-modal contrastive learning with automated triplet generation for scalable training, achieving state-of-the-art results on the OmniObject3D benchmark.

Introduces a novel rotation-invariant state-space model, RI-Mamba, for robust text-to-shape retrieval by disentangling pose from geometry and incorporating orientational embeddings.

Dasith de Silva Edirimuni, G. Hassan, Ajmal S. Mian2602.11673

Multimodal ModelsArchitecture Design (Transformers, SSMs, MoE)Recommendation & Information Retrieval

2d ago

ULTRA:Urdu Language Transformer-based Recommendation Architecture

The paper introduces ULTRA, a transformer-based recommendation architecture for Urdu, a low-resource language, to improve personalized news retrieval. ULTRA employs a dual-embedding architecture with a query-length aware routing mechanism to handle varying query lengths, directing queries to either title/headline-level or full-content pipelines. Experiments on a large Urdu news corpus demonstrate that ULTRA achieves over 90% precision compared to single-pipeline baselines, showing improved recommendation relevance.

Introduces a query-adaptive dual-embedding architecture for semantic content recommendation in low-resource languages, dynamically routing queries based on length to optimize retrieval relevance.

Alishbah Bashir, Fatima Qaiser, Ijaz Hussain2602.11836

Architecture Design (Transformers, SSMs, MoE)Natural Language ProcessingRecommendation & Information Retrieval

Bilibili Inc.2d ago

Compress, Cross and Scale: Multi-Level Compression Cross Networks for Efficient Scaling in Recommender Systems

The paper introduces Multi-Level Compression Cross Networks (MLCC) and its multi-channel extension (MC-MLCC) to efficiently model high-order feature interactions in recommender systems. MLCC uses hierarchical compression and dynamic composition to capture feature dependencies with favorable computational complexity, while MC-MLCC decomposes feature interactions into parallel subspaces for efficient horizontal scaling. Experiments on public and industrial datasets demonstrate that MLCC and MC-MLCC outperform DLRM-style baselines, achieving up to 0.52 AUC improvement and up to 26x reduction in parameters and FLOPs, and the approach has been adopted in Bilibili's advertising system.

Introduces a novel feature interaction architecture, MLCC, that uses hierarchical compression and dynamic composition to efficiently capture high-order feature interactions, along with its multi-channel extension, MC-MLCC, for improved scalability.

Heng Yu, Xiangjun Zhou, Heng Zhao +12602.12041

Recommendation & Information RetrievalArchitecture Design (Transformers, SSMs, MoE)Inference & Quantization

2d ago

Dopamine: Brain Modes, Not Brains

The paper introduces a novel parameter-efficient fine-tuning (PEFT) method called \methodname{} that adapts large pretrained models by learning per-neuron thresholds and gains in activation space, inspired by neuromodulation. This approach aims to change the mode of computation by selecting and rescaling existing computations rather than rewriting weights, offering improved interpretability. Experiments on MNIST and rotated MNIST demonstrate that \methodname{} can improve accuracy over a frozen baseline with significantly fewer trainable parameters than LoRA, while also enabling neuron-level attribution and conditional computation.

Introduces \methodname{}, a parameter-efficient fine-tuning method that learns per-neuron thresholds and gains in activation space to adapt pretrained models by changing the mode of computation.

S. Ghasemlou2602.11726

Interpretability & Mechanistic InterpTraining Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

From Atoms to Trees: Building a Structured Feature Forest with Hierarchical Sparse Autoencoders

This paper introduces Hierarchical Sparse Autoencoders (HSAEs) to explicitly model the hierarchical relationships between features extracted from LLMs, addressing the limitation of standard SAEs that treat features in isolation. HSAEs incorporate a structural constraint loss and random feature perturbation to encourage alignment between parent and child features in the learned hierarchy. Experiments across various LLMs and layers demonstrate that HSAEs recover semantically meaningful hierarchies while preserving reconstruction fidelity and interpretability.

Introduces Hierarchical Sparse Autoencoders (HSAEs) to learn and represent the hierarchical relationships between features extracted from LLMs.

Jiedong Jiang2602.11881

Interpretability & Mechanistic InterpArchitecture Design (Transformers, SSMs, MoE)Natural Language Processing

2d ago

Free Lunch for Stabilizing Rectified Flow Inversion

This paper addresses the instability issues in Rectified Flow (RF) inversion, which arise from accumulated approximation errors during the inversion process. They introduce Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it towards a running average of past velocities within a theoretically-motivated spherical Gaussian constraint. The authors further propose mimic-CFG, a velocity correction scheme for editing tasks that interpolates between the current velocity and its projection onto the historical average.

Introduces Proximal-Mean Inversion (PMI) and mimic-CFG, two novel, training-free methods to stabilize Rectified Flow inversion and improve image reconstruction and editing fidelity.

Chenru Wang, Beier Zhu2602.11850

Computer VisionTraining Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs

This paper extends crosscoder model diffing to cross-architecture comparisons, enabling the unsupervised discovery of behavioral differences between LLMs with different architectures. They introduce Dedicated Feature Crosscoders (DFCs), an architectural modification to improve the isolation of unique features in one model compared to another. Applying this technique, they identify features such as CCP alignment in Qwen3-8B and Deepseek-R1-0528-Qwen3-8B, American exceptionalism in Llama3.1-8B-Instruct, and a copyright refusal mechanism in GPT-OSS-20B.

Introduces Dedicated Feature Crosscoders (DFCs), an architectural modification to enhance crosscoder model diffing for isolating features unique to individual models in cross-architecture comparisons.

Thomas Jiralerspong, Trenton Bricken2602.11729

Interpretability & Mechanistic InterpArchitecture Design (Transformers, SSMs, MoE)Open-Source Models & Weights

2d ago

Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation

This paper investigates the use of local vision-language models (VLMs) to improve fine-grained activity recognition in newborn resuscitation videos, comparing them to a TimeSformer baseline. The authors explored zero-shot VLM strategies and fine-tuned VLMs with LoRA on a simulated dataset of 13.26 hours of video. Fine-tuning a local VLM with LoRA achieved an F1 score of 0.91, outperforming the TimeSformer baseline (0.70), suggesting the potential of VLMs for this task.

Demonstrates that fine-tuning local vision-language models with LoRA can significantly improve activity recognition in newborn resuscitation videos compared to a TimeSformer baseline.

Enrico Guerriero, Kjersti Engan, Oyvind Meinich-Bache2602.12002

Multimodal ModelsComputer VisionArchitecture Design (Transformers, SSMs, MoE)

2d ago

Benchmarking for Single Feature Attribution with Microarchitecture Cliffs

This paper introduces Microarchitecture Cliffs, a benchmark generation methodology to identify and attribute microarchitectural mismatches between architectural simulators and RTL implementations for model calibration. The Cliff methodology generates benchmarks that isolate individual microarchitectural features, enabling precise attribution of behavioral differences. Applying this methodology to calibrate XS-GEM5 against XS-RTL, the authors reduced performance error on Cliff benchmarks from 59.2% to 1.4% and improved performance prediction accuracy on SPEC2017 benchmarks.

Introduces a novel benchmark generation methodology, Microarchitecture Cliffs, for isolating and attributing microarchitectural discrepancies between simulators and RTL implementations, significantly improving simulator calibration accuracy.

Hao Zhen, Qingxuan Kang, Yungang Bao +12602.11580

Eval Frameworks & BenchmarksArchitecture Design (Transformers, SSMs, MoE)

2d ago

TAVAE: A VAE with Adaptable Priors Explains Contextual Modulation in the Visual Cortex

This paper introduces the Task-Amortized Variational Autoencoder (TAVAE), a generative model of V1 activity, to investigate how task-specific priors are learned and deployed in the visual cortex. TAVAE extends the VAE framework to efficiently acquire new tasks by reusing previously learned representations, allowing for flexible adaptation of priors. By comparing TAVAE's posterior distributions with large-scale V1 recordings from mice performing a discrimination task, the study demonstrates that the visual system can rapidly learn and utilize task-specific contextual priors, reflected in bimodal response profiles when task statistics are violated.

Introduces the Task-Amortized Variational Autoencoder (TAVAE), a novel VAE architecture that enables efficient learning of task-specific priors by amortizing learning across tasks.

Bal'azs Mesz'ena, Keith T. Murray, Julien Corbo +42602.11956

Computer VisionArchitecture Design (Transformers, SSMs, MoE)

2d ago

Device-Circuit Co-Design of Variation-Resilient Read and Write Drivers for Antiferromagnetic Tunnel Junction (AFMTJ) Memories

This paper addresses the challenge of unreliable read/write operations in Antiferromagnetic Tunnel Junction (AFMTJ) memories due to their ultrafast dynamics and low tunnel magnetoresistance (TMR). They propose a device-circuit co-design approach, specifically an asymmetric pulse driver (PD) for write operations and a self-timed sense amplifier (STSA) with dynamic trip-point tuning for read operations. Simulation results demonstrate improved read/write yield under process, voltage, and temperature (PVT) variations and 3D integration parasitics compared to standard MRAM front-ends, while preserving AFMTJ latency and energy benefits.

Introduces a device-circuit co-designed read/write interface, comprising an asymmetric pulse driver and a self-timed sense amplifier with dynamic trip-point tuning, to enhance the robustness of AFMTJ memories under realistic operating conditions.

Yousuf Choudhary, Tosiron Adegbija2602.11614

Distributed Systems & HardwareArchitecture Design (Transformers, SSMs, MoE)

2d ago

WaveFormer: Wavelet Embedding Transformer for Biomedical Signals

The paper introduces WaveFormer, a transformer architecture tailored for biomedical signal classification, addressing limitations of standard transformers in capturing multi-scale frequency patterns in long sequences. WaveFormer incorporates wavelet decomposition in both the embedding construction via multi-channel DWT and positional encoding via Dynamic Wavelet Positional Encoding (DyWPE). Experiments across eight datasets for human activity recognition and brain signal analysis demonstrate WaveFormer's competitive performance by effectively integrating frequency-domain information.

Introduces a novel transformer architecture, WaveFormer, that integrates wavelet decomposition into both the embedding and positional encoding stages to improve biomedical signal classification.

Habib Irani, Bikram De, V. Metsis2602.12189

Architecture Design (Transformers, SSMs, MoE)Speech & AudioTraining Efficiency & Optimization

2d ago

Fourier Transformers for Latent Crystallographic Diffusion and Generative Modeling

This paper introduces a reciprocal-space generative pipeline for crystalline materials, representing crystals via a truncated Fourier transform of the species-resolved unit-cell density. This Fourier representation inherently handles periodic boundary conditions and crystallographic symmetries, while also supporting variable atomic multiplicities. The pipeline is instantiated using a transformer variational autoencoder and a latent diffusion model, demonstrating effective reconstruction and unconditional generation of crystal structures.

Introduces a novel reciprocal-space generative pipeline using Fourier transforms to represent and generate crystalline materials, inherently addressing periodicity, symmetry, and variable atomic multiplicities.

J. Duersch, Elohan Veillon, Astrid Klipfel +22602.12045

Scientific Discovery & Drug DesignArchitecture Design (Transformers, SSMs, MoE)Data Curation & Synthetic Data

2d ago

In-Context Function Learning in Large Language Models

This paper investigates in-context learning in LLMs by framing it as Gaussian Process (GP) regression, using controlled experiments with function samples drawn from known GP priors. They compare LLM prediction error against empirical GP-regression (lower bound) and 1-NN (upper bound) baselines, finding that LLM learning curves approach the GP lower bound with increasing demonstrations. The authors also analyze LLM inductive biases via likelihood analysis, revealing a preference for less smooth GP kernels, and demonstrate that post-training can shift these biases to improve sample efficiency on smoother kernels.

Quantifies the extent to which LLMs behave like GP learners and provides methods for steering their inductive biases for continuous function learning tasks.

Elif Akata, Konstantinos Voudouris, Vincent Fortuin +12602.11863

Scaling Laws & Emergent AbilitiesNatural Language ProcessingArchitecture Design (Transformers, SSMs, MoE)

2d ago

Are Two LLMs Better Than One? A Student-Teacher Dual-Head LLMs Architecture for Pharmaceutical Content Optimization

The paper introduces LRBTC, a modular LLM and VLM-driven architecture for quality control in pharmaceutical content, addressing the need for scalable and verifiable validation in regulated domains. LRBTC employs a Student-Teacher dual model architecture combined with a human-in-the-loop workflow and waterfall rule filtering. The approach achieves significant improvements on AIReg-Bench (83.0% F1, 97.5% recall) and CSpelling (26.7% accuracy improvement), demonstrating its effectiveness in reducing missed violations and improving content quality.

Introduces LRBTC, a novel LLM and VLM-driven quality control architecture that leverages a Student-Teacher dual model and HITL workflow for pharmaceutical content optimization.

Suyash Mishra, Anubhav Girdhar2602.11957

Architecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug DesignNatural Language Processing

2d ago

Resource-Aware Deployment Optimization for Collaborative Intrusion Detection in Layered Networks

This paper introduces a Collaborative Intrusion Detection System (CIDS) framework that dynamically optimizes the allocation of intrusion detectors across nodes in a layered network based on available resources and data types. The framework adapts to changing operational scenarios by reconfiguring detectors to maintain an optimal configuration without requiring heavy computation, making it suitable for edge device deployment. The evaluation, conducted using distributed datasets including a novel dataset based on a cyberattack targeting a ground drone, demonstrates the framework's ability to achieve adaptive and efficient intrusion detection.

Introduces a resource-aware CIDS framework that dynamically optimizes detector allocation in layered networks for efficient intrusion detection in resource-constrained environments.

Ines Rieger, Wolfgang Hotwagner, Max Landauer +32602.11851

Distributed Systems & HardwareArchitecture Design (Transformers, SSMs, MoE)

2d ago

Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration

The authors extend the Puzzle post-training neural architecture search framework to optimize the gpt-oss-120B model, creating gpt-oss-puzzle-88B, by combining heterogeneous MoE expert pruning, selective attention replacement, FP8 quantization, and post-training reinforcement learning. This optimized model achieves significant per-token throughput speedups (up to 2.82X on a single H100 GPU) while maintaining or slightly exceeding the parent model's accuracy across various benchmarks. The paper advocates for request-level efficiency metrics to account for varying token counts and demonstrates that gpt-oss-puzzle-88B improves request-level efficiency by up to 1.29X.

Introduces a pipeline combining heterogeneous MoE expert pruning, selective attention replacement, FP8 quantization, and post-training reinforcement learning within the Puzzle framework to optimize large language models for inference.

A. Bercovich, Nir Ailon, Vladimir Anisimov +212602.11937

Architecture Design (Transformers, SSMs, MoE)Inference & QuantizationOpen-Source Models & Weights

2d ago

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

The paper introduces DeepGen 1.0, a 5B parameter unified multimodal model for image generation and editing, designed to be lightweight and efficient compared to larger models. To enhance semantic understanding in the compact model, they propose Stacked Channel Bridging (SCB) to extract and fuse hierarchical features from VLMs with learnable 'think tokens'. They also employ a three-stage data-centric training strategy, including alignment pre-training, joint supervised fine-tuning, and reinforcement learning with MR-GRPO, achieving state-of-the-art performance on benchmarks like WISE and UniREditBench while using only 50M training samples.

Introduces Stacked Channel Bridging (SCB), a novel deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to improve the generative backbone's semantic understanding and fine-grained control.

Ruihang Li, Feng Han, Wei Song +92602.12205

Multimodal ModelsComputer VisionArchitecture Design (Transformers, SSMs, MoE)

2d ago

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

The paper introduces SParse Expert Synchronization (SPES), a decentralized training framework for Mixture-of-Experts (MoE) LLMs that reduces memory footprint by training only a subset of experts per node and periodically synchronizing them. This approach addresses the GPU memory limitations of existing decentralized training methods, which still require training the entire model on each node. The authors demonstrate that SPES enables training of 2B, 7B, and 9B parameter MoE models on resource-constrained hardware, achieving performance comparable to centrally trained LLMs with similar computational budgets.

Introduces SParse Expert Synchronization (SPES), a memory-efficient decentralized training framework that enables pretraining large MoE language models on distributed GPUs with limited memory.

Jinrui Zhang, Xindong Zhang, Lei Zhang2602.11543

Distributed Systems & HardwareArchitecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

2d ago

LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts

The paper introduces LUVE, a latent-cascaded framework for ultra-high-resolution (UHR) video generation that tackles challenges in motion modeling, semantic planning, and detail synthesis. LUVE uses a three-stage architecture: low-resolution motion generation, latent upsampling, and high-resolution content refinement with dual frequency experts. Experiments demonstrate that LUVE achieves superior photorealism and content fidelity in UHR video generation compared to existing methods.

Introduces a novel latent-cascaded architecture with dual-frequency experts for generating ultra-high-resolution videos, improving both photorealism and content fidelity.

Chen Zhao, Jiawei Chen, Zhuoliang Kang +32602.11564

Computer VisionArchitecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

2d ago

Diffusion Alignment Beyond KL: Variance Minimisation as Effective Policy Optimiser

The paper introduces Variance Minimisation Policy Optimisation (VMPO) for diffusion alignment, framing the process as Sequential Monte Carlo and minimizing the variance of log importance weights instead of using a KL divergence objective. This approach is motivated by the SMC interpretation of diffusion alignment where the denoising model acts as a proposal and reward guidance induces importance weights. The authors demonstrate that minimizing the variance objective leads to the reward-tilted target distribution and recovers existing KL-based alignment methods under specific conditions, while also suggesting novel alignment strategies.

Introduces Variance Minimisation Policy Optimisation (VMPO) as a novel objective for diffusion alignment, minimizing the variance of log importance weights within an SMC framework.

Zijing Ou, Jacob Si, Junyi Zhu +42602.12229

Training Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

Categorical Flow Maps

The paper introduces Categorical Flow Maps, a flow-matching method designed for fast, few-step generation of categorical data using self-distillation. By defining a continuous flow map towards the simplex, the method transports probability mass to a predicted endpoint, enabling the use of distillation techniques and a novel endpoint consistency objective. Experiments demonstrate state-of-the-art few-step generation performance across images, molecular graphs, and text, even achieving strong results in single-step generation.

Introduces a continuous flow-matching formulation for categorical data generation that enables self-distillation and endpoint consistency training, leading to accelerated sampling.

Daan Roos, Oscar Davis, Floor Eijkelboom +52602.12233

Inference & QuantizationTraining Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

Oscillators Are All You Need: Irregular Time Series Modelling via Damped Harmonic Oscillators with Closed-Form Solutions

The paper introduces a novel approach for irregular time series modeling by replacing Neural ODEs with a linear damped harmonic oscillator analogy that admits a closed-form solution, thereby avoiding computationally expensive numerical solvers. Keys and values are modeled as damped, driven oscillators, and the query is expanded in a sinusoidal basis, with attention modeled as a resonance phenomenon. The method is proven to maintain the universal approximation property of continuous-time attention and achieves state-of-the-art performance on irregular time series benchmarks with significant speedups.

Introduces a computationally efficient irregular time series model based on damped harmonic oscillators with closed-form solutions, demonstrating state-of-the-art performance and theoretical guarantees.

Yashas Shende, Aritra Das, Reva Laxmi Chauhan +12602.12139

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

2d ago

EM-Aware Physical Synthesis: Neural Inductor Modeling and Intelligent Placement&Routing for RF Circuits

This paper introduces an ML-driven physical synthesis framework for RF circuits that addresses limitations of prior ML approaches by incorporating EM-accurate component models and routing capabilities. They trained a neural network on a large dataset of inductor geometries to predict Q-factor with high accuracy, enabling gradient-based layout optimization. The framework integrates a P-Cell optimizer and a placement/routing engine with EM spacing rules, resulting in DRC-aware GDSII layouts.

Introduces an end-to-end ML-driven framework for RF physical synthesis that generates manufacturable GDSII layouts by integrating EM-aware neural inductor modeling with intelligent placement and routing.

Yilun Huang, Asal Mehradfar, Salman Avestimehr +12602.11461

Code Generation & Program SynthesisArchitecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

2d ago

EqDeepRx: Learning a Scalable MIMO Receiver

The paper introduces EqDeepRx, a deep-learning-aided MIMO receiver that combines linear processing with learned components for improved scaling and generalization. EqDeepRx employs a shared-weight DetectorNN operating on individual spatial streams to achieve near-linear complexity scaling with multiplexing order, and uses a DenoiseNN to enhance channel estimation. End-to-end simulations demonstrate that EqDeepRx achieves improved error rate and spectral efficiency compared to conventional receivers while maintaining low complexity and supporting various MIMO configurations without retraining.

Introduces a novel deep-learning-aided MIMO receiver architecture, EqDeepRx, that achieves near-linear complexity scaling with multiplexing order through a shared-weight DetectorNN and enhances generalization via a DenoiseNN.

Mikko Honkala, Dani Korpi, Elias Raninen +12602.11834

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & OptimizationInference & Quantization

Applied AI Institute2d ago

U-Former ODE: Fast Probabilistic Forecasting of Irregular Time Series

The paper introduces U-Former ODE (UFO), a novel architecture for probabilistic forecasting of irregular time series data that combines U-Nets, Transformers, and Neural CDEs. UFO enables parallelizable computation and global receptive fields, addressing the scalability limitations of existing Neural CDE approaches. Experiments on five benchmarks demonstrate that UFO outperforms ten state-of-the-art baselines in predictive accuracy and achieves up to 15x faster inference, particularly on long and multivariate sequences.

Introduces a fully causal, parallelizable architecture, U-Former ODE (UFO), that integrates U-Nets, Transformers, and Neural CDEs for efficient and accurate probabilistic forecasting of irregular time series.

Ilya Kuleshov, Alexander Marusov, Alexey Zaytsev2602.11738

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & OptimizationNatural Language Processing

2d ago

TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR

The paper introduces Trans-Chunk BiMamba (TC-BiMamba), a novel architecture for unified streaming and non-streaming automatic speech recognition (ASR) that addresses the limitations of existing BiMamba-based streaming methods which are restricted to fixed chunk sizes. TC-BiMamba employs a trans-chunk mechanism to train bidirectional sequences offline with dynamic chunk sizes, enabling a single model to handle both offline and streaming decoding with varying latency requirements. Experiments demonstrate that TC-BiMamba achieves a 1.3x training speedup, reduces memory consumption by 50%, and improves ASR performance compared to chunk-wise processing, while also outperforming U2++ and matching LC-BiMamba with a smaller model size.

Introduces the Trans-Chunk BiMamba (TC-BiMamba) architecture, enabling efficient dynamic chunk size training for unified streaming and non-streaming ASR.

Qingshun She, Yangui Fang, Yu Xi2602.11546

Architecture Design (Transformers, SSMs, MoE)Speech & AudioTraining Efficiency & Optimization

TH Koeln2d ago

A technical curriculum on language-oriented artificial intelligence in translation and specialised communication

This paper introduces a technical curriculum designed to enhance AI literacy within the language and translation (L&T) industry, covering vector embeddings, neural networks, tokenization, and transformer networks. The curriculum aims to cultivate computational thinking, algorithmic awareness, and agency among L&T professionals to improve their digital resilience. Evaluation in an MA course at TH Koeln suggests the curriculum's effectiveness, while also highlighting the need for additional lecturer support to maximize learning outcomes.

Proposes and evaluates a technical curriculum focused on language-oriented AI to improve AI literacy and digital resilience in the language and translation industry.

Ralph Kruger2602.12251

Natural Language ProcessingArchitecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

2d ago

On the implicit regularization of Langevin dynamics with projected noise

The paper analyzes Langevin dynamics with noise projected onto directions orthogonal to an isometric group action, a model relevant to understanding symmetry effects in stochastic gradient descent for over-parameterized models. The key finding is that when initial and target densities are group-invariant, this projected Langevin dynamics is equivalent in law to standard Langevin dynamics with isotropic diffusion but with an additional drift term related to the negative log volume of the group orbit. This equivalence is proven through a coupling argument involving a third process on the group, identifying the drift as the mean curvature of the orbits, thus revealing a novel form of implicit regularization.

Establishes an equivalence between Langevin dynamics with projected noise and standard Langevin dynamics with an additional drift term proportional to the negative log volume of the group orbit, revealing a novel form of implicit regularization.

Govind Menon, Austin J. Stromme, Adrien Vacher2602.12257

Training Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision

This paper introduces an energy-aware spike budgeting framework for continual learning in spiking neural networks (SNNs) to address catastrophic forgetting while optimizing for energy efficiency. The framework combines experience replay, learnable LIF neuron parameters, and an adaptive spike scheduler to enforce dataset-specific energy constraints during training. Results show that spike budgeting acts as a sparsity-inducing regularizer on frame-based datasets, improving accuracy and reducing spike rates, while controlled budget relaxation enables accuracy gains on event-based datasets.

Introduces an energy-aware spike budgeting framework that adaptively controls spike rates during continual learning in SNNs to improve both accuracy and energy efficiency across frame-based and event-based neuromorphic vision datasets.

Anika Tabassum Meem, Muntasir Hossain Nadid, Md Zesun Ahmed Mia2602.12236

Computer VisionTraining Efficiency & OptimizationInference & QuantizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

Prototype Transformer: Towards Language Model Architectures Interpretable by Design

The paper introduces the Prototype Transformer (ProtoT), an autoregressive language model architecture that uses prototypes (parameter vectors) instead of self-attention to improve interpretability. ProtoT establishes two-way communication between the input sequence and the prototypes, causing the prototypes to capture nameable concepts during training and creating interpretable communication channels. Experiments demonstrate that ProtoT scales linearly with sequence length, performs well on text generation and downstream tasks (GLUE), and exhibits robustness to input perturbations while providing interpretable pathways for understanding robustness and sensitivity.

Introduces the Prototype Transformer, a novel autoregressive language model architecture designed for interpretability by using prototypes to capture nameable concepts and create interpretable communication channels.

Yordan Yordanov, Matteo Forasassi, Bayar Menzat +62602.11852

Interpretability & Mechanistic InterpArchitecture Design (Transformers, SSMs, MoE)Natural Language Processing

2d ago

Tiny Recursive Reasoning with Mamba-2 Attention Hybrid

This paper explores the use of Mamba-2 hybrid operators within Tiny Recursive Models (TRM) for abstract reasoning, motivated by Mamba-2's inherent iterative refinement properties. By replacing Transformer blocks in TRM with Mamba-2 hybrids while maintaining parameter parity, the authors demonstrate improved performance on the ARC-AGI-1 benchmark. Specifically, the Mamba-2 hybrid TRM achieves a +2.0% improvement in pass@2 and a +4.75% improvement in pass@100, suggesting enhanced candidate coverage.

Demonstrates that Mamba-2 hybrid operators can effectively replace Transformer blocks within Tiny Recursive Models, leading to improved performance on abstract reasoning tasks.

Wenlong Wang, Fergal Reid2602.12078

Reasoning & Chain-of-ThoughtArchitecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

2d ago

Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation

This paper investigates the phenomenon of "token overflow" in soft compression architectures for retrieval-augmented generation (RAG), where compressed token representations lose task-relevant information. They propose a methodology to characterize and detect token overflow, evaluating it within the xRAG framework. Their key finding is that lightweight probing classifiers, leveraging both query and context xRAG representations, achieve an average AUC-ROC of 0.72 in detecting overflow across HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating the importance of query-aware detection.

Introduces a methodology using lightweight probing classifiers to detect token overflow in compressed token representations for retrieval-augmented generation by leveraging query and context information.

Julia Belikova, Danila Rozhevskii, Dennis Svirin +22602.12235

Inference & QuantizationRecommendation & Information RetrievalArchitecture Design (Transformers, SSMs, MoE)

2d ago

Hi-SAM: A Hierarchical Structure-Aware Multi-modal Framework for Large-Scale Recommendation

The paper introduces Hi-SAM, a novel multi-modal recommendation framework designed to address limitations in semantic ID-based approaches, specifically suboptimal tokenization and architecture-data mismatch. Hi-SAM employs a Disentangled Semantic Tokenizer (DST) that uses geometry-aware alignment and coarse-to-fine quantization to separate shared and modality-specific semantics, and a Hierarchical Memory-Anchor Transformer (HMAT) that incorporates hierarchical positional encoding and anchor tokens to better model user-item interactions. Experiments on real-world datasets and a large-scale social platform demonstrate that Hi-SAM outperforms state-of-the-art baselines, particularly in cold-start scenarios, achieving a 6.55% improvement in a core online metric.

Introduces a hierarchical structure-aware multi-modal framework, Hi-SAM, that disentangles cross-modal semantics and modality-specific details during tokenization and incorporates hierarchical positional encoding within a transformer architecture for improved recommendation performance.

Pin-Yu Pan, Tingting Fei, Hongxiang Chen2602.11799

Multimodal ModelsArchitecture Design (Transformers, SSMs, MoE)Recommendation & Information Retrieval

2d ago

PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving

The paper introduces PrefillShare, an algorithm for sharing the prefill stage across multiple language models in disaggregated serving environments to reduce redundant computation and KV cache storage. PrefillShare factorizes models into prefill and decode modules, freezes the prefill module, and fine-tunes only the decode module, enabling multiple models to share a prefill module and its KV cache. Experiments demonstrate that PrefillShare achieves comparable accuracy to full fine-tuning while significantly improving latency (4.5x lower p95) and throughput (3.9x higher) in multi-model agent workloads.

Introduces PrefillShare, a novel algorithm that enables efficient sharing of the prefill stage and KV cache across multiple language models in a disaggregated serving system.

Sunghyeon Woo, Hoseung Kim, Sunghwan Shim +22602.12029

Inference & QuantizationDistributed Systems & HardwareArchitecture Design (Transformers, SSMs, MoE)

2d ago

Empirical Gaussian Processes

The paper introduces Empirical Gaussian Processes (GPs), a framework for constructing data-driven GP priors by empirically estimating the mean and covariance functions from historical observations. This approach overcomes limitations of handcrafted kernels, enabling the prior to reflect complex covariance structures present in the data. The authors derive an Expectation-Maximization algorithm with closed-form updates for learning the GP prior from independent datasets with heterogeneous observation locations, and demonstrate competitive performance on learning curve extrapolation and time series forecasting.

Introduces Empirical GPs, a novel method for learning GP priors directly from data by estimating the mean and covariance functions, thereby improving adaptability and reducing reliance on expert-defined kernels.

J. Lin, S. Ament, Louis C. Tiao +32602.12082

Training Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer

The paper introduces AssetFormer, an autoregressive Transformer model for generating modular 3D assets from text descriptions, addressing the need for high-quality, diverse assets in the digital industry. AssetFormer models the generation of 3D assets as a sequence of primitives with constrained design parameters, adapting module sequencing and decoding techniques from language models. Experiments using real-world modular assets demonstrate the model's effectiveness in streamlining asset creation for professional development and UGC scenarios.

Introduces an autoregressive Transformer-based architecture, AssetFormer, for generating modular 3D assets from textual descriptions by modeling the asset as a sequence of primitives.

Lingting Zhu, Shengju Qian, Siwei Zhou +22602.12100

Architecture Design (Transformers, SSMs, MoE)Multimodal ModelsComputer Vision

2d ago

Temporal Difference Learning with Constrained Initial Representations

This paper addresses the sample inefficiency of off-policy reinforcement learning by constraining the initial representations of input data to alleviate distribution shift. They introduce a novel framework, CIR, incorporating a Tanh activation function in the initial layer, normalization techniques, skip connections, and convex Q-learning. Theoretical analysis demonstrates the convergence of temporal difference learning with the Tanh function under linear function approximation, and empirical results show CIR achieves strong performance on continuous control tasks.

Introduces a Constrained Initial Representations (CIR) framework that improves off-policy RL sample efficiency by constraining initial representations using a Tanh activation, normalization, skip connections, and convex Q-learning.

Jiafei Lyu, Jingwen Yang, Zhongjian Qiao +32602.11800

Training Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

2d ago

U-DAVI: Uncertainty-Aware Diffusion-Prior-Based Amortized Variational Inference for Image Reconstruction

This paper introduces U-DAVI, an uncertainty-aware amortized variational inference framework for image reconstruction that leverages diffusion priors. By injecting spatially adaptive perturbations to measurements during training, guided by uncertainty estimates, U-DAVI focuses learning on uncertain regions, improving reconstruction quality. Experiments on deblurring and super-resolution tasks demonstrate that U-DAVI achieves competitive or superior performance compared to existing diffusion-based methods, while maintaining computational efficiency.

Introduces an uncertainty-aware training strategy for amortized variational inference with diffusion priors, enabling improved image reconstruction by focusing learning on uncertain regions.

Ayush Varshney, K. Bouman2602.11704

Computer VisionTraining Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

Lattice is designed for desktop

Architecture Design (Transformers, SSMs, MoE)

Keywords

Top Labs in This Topic

Recent Papers

Lattice is designed for desktop

Architecture Design (Transformers, SSMs, MoE)

Keywords

Top Labs in This Topic

Recent Papers

Search