March 4 – March 11, 2026

Architecture Design (Transformers, SSMs, MoE) - Weekly Roundup

100 papers published across 4 labs.

2% acceleration

Selected Labs publishing this week

Tsinghua AI7 MIT CSAIL2 Meta AI1 DeepMind1

Top Papers

Mar 10, 2026

Meta AI3w ago

ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning

Forget imbalanced LoRA usage: ReMix leverages reinforcement learning to route effectively among LoRAs, boosting performance in parameter-efficient fine-tuning.

Ruizhong Qiu, Hanqing Zeng, Yinglong Xia +14

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Mar 11, 2026

Jing Peng +93w ago·also M-A-P

G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition

G-STAR tackles long-form, multi-speaker ASR by giving Speech-LLMs time-aware speaker tracking, enabling robust identity linking across chunks.

Jing Peng, Ziyi Chen, Haoyu Li +7

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Tsinghua AI3w ago

S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance

Exploit the surprisingly stable, yet heterogeneous, sparsity patterns across attention heads to slash LLM attention latency by 2.88x without sacrificing quality.

Yifei Liu, Chen Chen, Zhibin Yu +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Massimiliano Altieri +43w ago

DNS-GT: A Graph-based Transformer Approach to Learn Embeddings of Domain Names from DNS Queries

By modeling contextual relationships between DNS queries, DNS-GT significantly improves domain name embedding quality, leading to better performance in botnet detection and domain classification.

Massimiliano Altieri, Ronan Hamon, Roberto Corizzo +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Recommendation & Information Retrieval

Stefanos Pasios +13w ago

HyPER-GAN: Hybrid Patch-Based Image-to-Image Translation for Real-Time Photorealism Enhancement

Achieve real-time photorealistic image enhancement without sacrificing visual quality or semantic consistency, thanks to a novel hybrid training strategy for GANs.

Stefanos Pasios, Nikos Nikolaidis

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

All Papers (100)

Mar 11, 2026

Jing Peng +93w ago·also M-A-P

G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition

G-STAR tackles long-form, multi-speaker ASR by giving Speech-LLMs time-aware speaker tracking, enabling robust identity linking across chunks.

Jing Peng, Ziyi Chen, Haoyu Li +7

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Tsinghua AI3w ago

S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance

Exploit the surprisingly stable, yet heterogeneous, sparsity patterns across attention heads to slash LLM attention latency by 2.88x without sacrificing quality.

Yifei Liu, Chen Chen, Zhibin Yu +2

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Massimiliano Altieri +43w ago

DNS-GT: A Graph-based Transformer Approach to Learn Embeddings of Domain Names from DNS Queries

By modeling contextual relationships between DNS queries, DNS-GT significantly improves domain name embedding quality, leading to better performance in botnet detection and domain classification.

Massimiliano Altieri, Ronan Hamon, Roberto Corizzo +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Recommendation & Information Retrieval

Stefanos Pasios +13w ago

HyPER-GAN: Hybrid Patch-Based Image-to-Image Translation for Real-Time Photorealism Enhancement

Achieve real-time photorealistic image enhancement without sacrificing visual quality or semantic consistency, thanks to a novel hybrid training strategy for GANs.

Stefanos Pasios, Nikos Nikolaidis

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

3w ago

Differentiable Geometric Indexing for End-to-End Generative Retrieval

By combining differentiable indexing with isotropic geometric optimization, DGI achieves state-of-the-art generative retrieval, especially for long-tail items that are often missed by other methods.

Architecture Design (Transformers, SSMs, MoE)Recommendation & Information Retrieval Training Efficiency & Optimization

MIT CSAIL3w ago

Shape Control of a Planar Hyper-Redundant Robot via Hybrid Kinematics-Informed and Learning-based Approach

Hyper-redundant robots get a 75% accuracy boost thanks to a neural network that adaptively blends learned behavior with kinematic priors.

Y. Song, Wenbo Li, Wenci Xin +3

Architecture Design (Transformers, SSMs, MoE)Robotics & Embodied AI

3w ago

Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers

Diffusion Transformers can be accelerated by up to 7x with nearly lossless performance using a training-free method that selectively computes on sparse anchor tokens, outperforming existing temporal acceleration techniques.

Wenhao Sun, Zhaoqiang Liu

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

3w ago

Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition

Explicitly aligning audio and video streams in a multimodal Transformer boosts emotion recognition, showing that ignoring frame-rate differences hurts performance.

Inyong Koo, Yeeun Seong, Minseok Son +2

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

3w ago·also Shenzhen Loop Area Institute

AlphaFlowTSE: One-Step Generative Target Speaker Extraction via Conditional AlphaFlow

Ditch slow, multi-step sampling for target speaker extraction: AlphaFlowTSE achieves faster, one-step generation with improved speaker similarity and real-world generalization.

Duojia Li, Shuhan Zhang, Zihan Qian +5

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

3w ago

Geometric Autoencoder for Diffusion Models

Ditch the heuristic latent spaces: Geometric Autoencoders offer a principled way to inject VFM priors into diffusion models, yielding state-of-the-art image generation with better compression and semantic depth.

Hangyu Liu, Jianyong Wang, Yutao Sun

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

S. Seelam +123w ago

Reference Architecture of a Quantum-Centric Supercomputer

Quantum-Centric Supercomputers promise to break down the barriers between quantum and classical computing, enabling seamless hybrid algorithms and accelerating discovery across applications.

S. Seelam, Jerry M Chow, Antonio C'orcoles +10

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Scientific Discovery & Drug Design

SamsungLabs3w ago·also KRAFTON, Samsung Electronics Co. Ltd.

LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

Get faster long-context LLM inference without sacrificing accuracy: LookaheadKV predicts KV cache importance, outperforming costly draft generation methods by 14.5x.

Jinwoo Ahn, Jinwoo Ahn, Ingyu Seong +11

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

3w ago

Quantum-logic spectroscopy of forbidden vibrational transitions in single nitrogen molecular ions

Quantum computers and molecular clocks just got a boost: researchers have achieved coherent control of forbidden vibrational transitions in single nitrogen molecular ions.

A. Shlykov, M. Diouf, Richard A Karl +3

Architecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug Design

Ezequiel López-Rubio +13w ago

Instruction set for the representation of graphs

Representing graphs as strings with a guaranteed-valid instruction set unlocks language model-based approaches for graph similarity, generation, and conditioned modeling.

Ezequiel López-Rubio, Mario Pascual-González

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis

David G'omez-Cambronero +23w ago

Layered Performance Analysis of TLS 1.3 Handshakes: Classical, Hybrid, and Pure Post-Quantum Key Exchange

Quantifying the overhead of post-quantum cryptography reveals exactly where the performance bottlenecks lie in real-world TLS 1.3 transactions.

David G'omez-Cambronero, D. Munteanu, Ana Isabel Gonz'alez-Tablas

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Heinz Nixdorf Institute at Paderborn3w ago·also Fraunhofer

FP-Predictor - False Positive Prediction for Static Analysis Reports

A GCN model trained on static analysis reports can achieve near-perfect accuracy in distinguishing true vulnerabilities from false positives, even uncovering genuine security weaknesses missed by the original SAST tools.

Tom Ohlmer, Michael Schlichtig, Eric Bodden

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis

3w ago·also BGI Research

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

This new OCR model beats Gemini-3.1-Pro and Qwen3-VL-235B on key information extraction, thanks to its clever "Layout-as-Thought" process that recovers layout grounding in end-to-end OCR.

Daxiang Dong, Mingming Zheng, Dong Xu +17

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

3w ago·also PKU, ZJU

UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations

Ditch discrete visual tokens: UniCom achieves SOTA multimodal generation by compressing continuous semantic representations, unlocking better controllability and consistency in image editing.

Yaqi Zhao, Wang Lin, Miles Yang +4

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Training Efficiency & Optimization

Tsinghua AI3w ago·also CAS, SUCCESS Lab, ZJU

GLM-OCR Technical Report

A compact 0.9B multimodal model, GLM-OCR, achieves state-of-the-art document understanding by predicting multiple tokens at once, boosting decoding throughput without blowing up memory.

Shuaiqi Duan, Ya-Qi Xue, Weihan Wang +18

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Training Efficiency & Optimization

3w ago·also Tsinghua AI

Neural Field Thermal Tomography: A Differentiable Physics Framework for Non-Destructive Evaluation

Differentiable physics enables high-resolution 3D tomography of subsurface defects by enforcing thermodynamic laws as hard constraints, outperforming traditional methods and PINNs.

Tao Zhong, Yixun Hu, Dongzhe Zheng +2

Architecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug Design Training Efficiency & Optimization

Yinfeng Xia +33w ago

Uni-ASR: Unified LLM-Based Architecture for Non-Streaming and Streaming Automatic Speech Recognition

A single LLM can now handle both non-streaming and streaming ASR, opening the door to more flexible and efficient speech recognition systems.

Yinfeng Xia, Junfeng Hou, Gaopeng Xu +1

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

3w ago

P-GSVC: Layered Progressive 2D Gaussian Splatting for Scalable Image and Video

Jointly training layered Gaussian splats boosts reconstruction quality by up to 2.6 dB, proving that coordinating optimization across layers is key for progressive 2D Gaussian splatting.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Qiyue Chen +53w ago

An FPGA Implementation of Displacement Vector Search for Intra Pattern Copy in JPEG XS

A pipelined FPGA architecture slashes the power consumption of JPEG XS's Intra Pattern Copy displacement vector search, enabling practical hardware deployment for low-latency image compression.

Qiyue Chen, Yao Li, Jie Tao +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Kaituo Xu +83w ago

FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

A single system now rivals or beats specialized models across ASR, voice activity detection, language ID, and punctuation, setting a new bar for industrial-grade speech processing.

Kaituo Xu, Yanchao Jia, Kai-Wei Huang +6

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Linkedin Inc3w ago

Beyond Interleaving: Causal Attention Reformulations for Generative Recommender Systems

Ditch the interleaved item-action token mess: new architectures slash sequence complexity by 50% in generative recommenders, boosting performance and cutting training time.

Hailing Cheng

Architecture Design (Transformers, SSMs, MoE)Recommendation & Information Retrieval

Chi-Hsun Chiang +43w ago

COT-FM: Cluster-wise Optimal Transport Flow Matching

Straighter flows, better generations: COT-FM carves up complex generative tasks into simpler, cluster-specific flows, leading to faster and more reliable sampling.

Chi-Hsun Chiang, Kuan-Hsun Tu, Jia-Wei Liao +2

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Sengim Karayalcin +33w ago

Backdoor Directions in Vision Transformers

Backdoor triggers in ViTs leave a surprisingly clear signature: a linear direction in activation space that can be directly manipulated to activate or deactivate the backdoor.

Sengim Karayalcin, Marina Krček, Pin-Yu Chen +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Red-Teaming & Adversarial Robustness

Mar 10, 2026

3w ago·also BUPT, CUHK, Fudan, NJU +3

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

A 4B-parameter model, InternVL-U, outperforms 14B-parameter models in multimodal generation and editing, proving that size isn't everything.

Changyao Tian, Danni Yang, Guanzhou Chen +27

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Open-Source Models & Weights

Erkan Turan +13w ago

Generative Drifting is Secretly Score Matching: a Spectral and Variational Perspective

Generative drifting's empirical success is no longer a mystery: it's secretly score matching, but with frequency-dependent convergence bottlenecks that explain the preference for Laplacian kernels.

Erkan Turan, Maks Ovsjanikov

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Profiteya LLC3w ago·also Mass General Brigham

Correction of Transformer-Based Models with Smoothing Pseudo-Projector

Make your transformers more robust to noise and improve training dynamics with a surprisingly simple, lightweight "pseudo-projector" module inspired by multigrid methods.

Vitaly Bulgakov

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Ruihan Xu +23w ago

On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

Row-normalized optimizers can match Muon's performance on large language models while being faster in large-token and low-loss regimes, offering a practical alternative for pre-training.

Ruihan Xu, Jiajin Li, Yiping Lu

Architecture Design (Transformers, SSMs, MoE)Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Albus Yizhuo Li +13w ago

Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers

Unlock calibrated uncertainty in Mixture-of-Experts Transformers with VMoER, a Bayesian routing method that slashes calibration error by 94% while barely impacting FLOPs.

Albus Yizhuo Li, Matthew Wicker

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

3w ago·also UZH

DendroNN: Dendrocentric Neural Networks for Energy-Efficient Classification of Event-Based Data

DendroNNs offer a 4x energy efficiency boost over existing neuromorphic hardware by mimicking dendritic computation and training via a gradient-free rewiring mechanism.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Provable Responsible AI and Data3w ago·also KAUST

GIAT: A Geologically-Informed Attention Transformer for Lithology Identification

By injecting geological priors into the attention mechanism, GIAT achieves state-of-the-art lithology identification while also improving the interpretability of the model's predictions.

Qishun Yang, Nuo Li

Architecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug Design

Yinpeng Wu +53w ago

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

On-device LLM inference can be sped up by an order of magnitude with a flexible TrustZone-based system that selectively protects memory and the NPU.

Yinpeng Wu, Yitong Chen, Lixiang Wang +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Yanshan Li +33w ago

M3GCLR: Multi-View Mini-Max Infinite Skeleton-Data Game Contrastive Learning For Skeleton-Based Action Recognition

State-of-the-art skeleton-based action recognition is now possible through a game-theoretic contrastive learning framework that maximizes action-relevant information while minimizing encoding redundancy.

Yanshan Li, Ke Ma, Miaomiao Wei +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Data Curation & Synthetic Data

R. Mahdavi +23w ago

ZipPIR: High-throughput Single-server PIR without Client-side Storage

ZipPIR delivers SimplePIR-level throughput without the massive client-side storage, finally making high-performance private information retrieval practical for resource-constrained devices.

R. Mahdavi, Abdulrahman Diaa, Florian Kerschbaum

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Recommendation & Information Retrieval

Sunjung Lee +103w ago

PIM-SHERPA: Software Method for On-device LLM Inference by Resolving PIM Memory Attribute and Layout Inconsistencies

On-device LLM inference with PIM is now more practical: PIM-SHERPA resolves memory inconsistencies, slashing memory capacity needs by ~50% without sacrificing performance.

Sunjung Lee, Sanghoon Cha, Hyeonsu Kim +8

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

José Luis Conradi Hoffmann +13w ago

Ensuring Data Freshness in Multi-Rate Task Chains Scheduling

Ditch the latency tax of traditional scheduling: this new approach delivers data "just-in-time" for safety-critical systems, boosting performance without sacrificing reliability.

José Luis Conradi Hoffmann, Antônio Augusto Fröhlich

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Robotics & Embodied AI

Xiamen University3w ago·also Tsinghua AI, Chongqing, Openharmony Community

Nemo: A Low-Write-Amplification Cache for Tiny Objects on Log-Structured Flash Devices

By strategically increasing hash collisions, Nemo slashes write amplification in flash caches for tiny objects, a persistent bottleneck even with advanced SSDs.

Xufeng Yang, Tingting Tan, Jingxin Hu +6

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

São Paulo State University (UNESP)3w ago·also CERN, Fermilab, T2_BR_SPRACE

Case Study: Performance Analysis of a Virtualized XRootD Frontend in Large-Scale WAN Transfers

A virtualized XRootD frontend can sustain over 50 Gb/s throughput in real-world large-scale WAN transfers, challenging assumptions about virtualization overhead in high-performance data systems.

J M da Silva, M A Costa, R L Iope

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Chaodong Xiao +13w ago

BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers

BinaryAttention proves you can more than halve the runtime of attention in vision and diffusion transformers without sacrificing accuracy, simply by using the sign of queries and keys.

Chaodong Xiao, Zhengqiang Zhang

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Ganzhao Yuan3w ago

OptEMA: Adaptive Exponential Moving Average for Stochastic Optimization with Zero-Noise Optimality

Forget manual hyperparameter tuning: OptEMA achieves near-optimal deterministic convergence in zero-noise stochastic optimization, adapting automatically.

Ganzhao Yuan

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Communications Research Centre3w ago·also Carleton

A Graph-Based Approach to Spectrum Demand Prediction Using Hierarchical Attention Networks

A hierarchical graph attention network beats traditional machine learning models by 21% in predicting spectrum demand, offering a more reliable approach to spectrum management.

Mohamad Alkadamani, Halim Yanikomeroglu, Amir Ghasemi

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing

Artemis Shaw +73w ago

Cutting the Cord: System Architecture for Low-Cost, GPU-Accelerated Bimanual Mobile Manipulation

A complete, GPU-accelerated bimanual mobile manipulation platform can be built for under $1300, opening up robotics research and education to a wider audience.

Artemis Shaw, Chen Liu, Justin Costa +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

3w ago·also Ningbo No, PolyU

Upper Generalization Bounds for Neural Oscillators

Regularizing Lipschitz constants in MLPs within neural oscillators provably and practically enhances generalization, offering a path to more robust learning of complex dynamical systems.

Zifeng Huang, Konstantin M. Zuev, Yong Xia +1

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Zhifei Zhang +33w ago

End-to-End Direction-Aware Keyword Spotting with Spatial Priors in Noisy Environments

Spatial audio cues and directional priors can be jointly learned end-to-end to significantly boost keyword spotting accuracy in noisy environments, outperforming traditional cascaded approaches.

Zhifei Zhang, Yu Gao, Xiaofeng Mou +1

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

3w ago

Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion

Forget blurry sketch-to-image outputs: this method uses component-aware self-attention and coordinate-preserving fusion to generate photorealistic images with unprecedented fidelity and spatial accuracy.

Ali Zia, Muhammad Umer Ramzan, Usman Ali +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Yanxin Li +23w ago

DCAU-Net: Differential Cross Attention and Channel-Spatial Feature Fusion for Medical Image Segmentation

By computing the *difference* between attention maps, DCAU-Net achieves state-of-the-art medical image segmentation while dramatically reducing computational cost compared to standard self-attention.

Yanxin Li, Hui Wan, Libin Lan

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Rojin Zandi +23w ago

Beyond Amplitude: Channel State Information Phase-Aware Deep Fusion for Robotic Activity Recognition

Ignoring CSI phase information in robotic activity recognition is a mistake: fusing it with amplitude data in a novel gated BiLSTM architecture significantly boosts accuracy and robustness.

Rojin Zandi, H. Salehinejad, Milad Siami

Architecture Design (Transformers, SSMs, MoE)Robotics & Embodied AI

3w ago

Nezha: A Key-Value Separated Distributed Store with Optimized Raft Integration

Nezha shatters I/O bottlenecks in distributed key-value stores by decoupling key-value persistence within Raft, yielding up to 4.6x throughput gains.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

3w ago

Physics-informed neural operator for predictive parametric phase-field modelling

Physics-informed neural operators can drastically improve the accuracy and stability of phase-field modeling, outperforming standard neural operators in complex materials simulations.

Nanxi Chen, Airong Chen, Rujin Ma

Architecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug Design

3w ago

From Data Statistics to Feature Geometry: How Correlations Shape Superposition

Forget interference as just noise: correlated features in neural networks can constructively superpose to form semantic clusters, especially with weight decay.

Lucas Prieto, Edward Stevinson, Melih Barsbey +2

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

University of St Andrews3w ago·also Durham University

Multi-DNN Inference of Sparse Models on Edge SoCs

By recombining subgraphs from sparse models without retraining, "model stitching" creates a diverse set of model variants that significantly improves the efficiency of multi-DNN inference on edge SoCs.

Jiawei Luo, Simon Dobson

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Kağan Akman +23w ago

An Optimal Control Approach To Transformer Training

Ditch finicky gradient descent: this paper recasts Transformer training as an optimal control problem, guaranteeing global optimality and robustness.

Kağan Akman, Naci Saldı, Serdar Yüksel

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

3w ago

Memorization capacity of deep ReLU neural networks characterized by width and depth

Forget parameter counts – the true memorization capacity of deep ReLU networks is fundamentally bounded by the product of squared width and squared depth, $W^2L^2$, scaling linearly with data size.

Xin Yang, Yunfei Yang

Architecture Design (Transformers, SSMs, MoE)Scaling Laws & Emergent Abilities Training Efficiency & Optimization

3w ago

Reviving ConvNeXt for Efficient Convolutional Diffusion Models

ConvNets strike back: a ConvNeXt-based diffusion model matches Transformer performance at half the FLOPs and 7x faster training, all on just 4 GPUs.

Taesung Kwon, Lorenzo Bianchi, Lennart Wittke +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Lionel Yelibi3w ago

a-TMFG: Scalable Triangulated Maximally Filtered Graphs via Approximate Nearest Neighbors

TMFGs can now scale to millions of data points thanks to a-TMFG, which approximates the correlation matrix on-the-fly using kNN graphs and clever memory management.

Lionel Yelibi

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Minchi Ruan +63w ago

ReTac-ACT: A State-Gated Vision-Tactile Fusion Transformer for Precision Assembly

A robot can now achieve 90% success in peg-in-hole tasks, even with only 0.1mm clearance, by intelligently fusing vision and tactile feedback when visual occlusion occurs.

Minchi Ruan, LiangQing Zhou, Hongtong Li +4

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Robotics & Embodied AI

Haoyuan Yang +43w ago

Emotion-Aware Prefix: Towards Explicit Emotion Control in Voice Conversion Models

Double the emotion conversion accuracy in voice conversion models with a simple prefix that jointly controls sequence modulation and acoustic realization.

Haoyuan Yang, Mu Yang, Jiamin Xie +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

3w ago·also Fudan, Shanghai AI Lab

Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning

Muon's "one-size-fits-all" spectral update is holding back your models: Mousse adapts to curvature and cuts training time by 12%.

Yechen Zhang, Shuhao Xing, Junhao Huang +2

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Thao Do +43w ago

LooComp: Leverage Leave-One-Out Strategy to Encoder-only Transformer for Efficient Query-aware Context Compression

Achieve RAG efficiency without sacrificing accuracy: LooComp prunes context by identifying and retaining only the most critical sentences for answering a query.

Thao Do, Dinh Phu Tran, An Vo +2

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Recommendation & Information Retrieval

Jianing Yang +23w ago

DuplexCascade: Full-Duplex Speech-to-Speech Dialogue with VAD-Free Cascaded ASR-LLM-TTS Pipeline and Micro-Turn Optimization

Unlock full-duplex speech-to-speech dialogue without VAD limitations using chunk-wise micro-turns and special control tokens to steer LLM behavior in a cascaded pipeline.

Jianing Yang, Yusuke Fujita, Yui Sudo

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Tsinghua AI3w ago·also Tencent AI

RiO-DETR: DETR for Real-time Oriented Object Detection

RiO-DETR makes real-time oriented object detection with transformers a reality by cleverly decoupling angle estimation and injecting angular diversity into dense supervision.

Zhangchi Hu, Yifan Zhao, Yansong Peng +8

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Siqi Pei +23w ago

DRIFT: Dual-Representation Inter-Fusion Transformer for Automated Driving Perception with 4D Radar Point Clouds

DRIFT achieves state-of-the-art object detection performance on 4D radar point clouds by fusing local and global contexts with a novel dual-representation transformer architecture.

Siqi Pei, Andras Palffy, Dariu M. Gavrila

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

Palmer Schallon3w ago

Surgical Repair of Collapsed Attention Heads in ALiBi Transformers

Pretrained ALiBi transformers suffer from a widespread attention collapse that can be surgically repaired to yield a 25% perplexity improvement, suggesting that standard pretraining leaves performance on the table.

Palmer Schallon

Architecture Design (Transformers, SSMs, MoE)Open-Source Models & Weights Training Efficiency & Optimization

Muhammad Ahmad +23w ago

On Catastrophic Forgetting in Low-Rank Decomposition-Based Parameter-Efficient Fine-Tuning

Tensor-based PEFT methods like LoRETTA can dramatically reduce catastrophic forgetting in sequential learning by capturing richer structural information within compact parameter budgets.

Muhammad Ahmad, Jingjing Zheng, Yankai Cao

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Tsinghua AI3w ago·also Google Research, CAS

From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding

By learning visual representations from scene-level semantics down to pixel-level details, C2FMAE overcomes the limitations of both contrastive learning and masked image modeling.

Wenzhao Xiang, Yue Wu, Hongyang Yu +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Corresponding Author3w ago

FreqCycle: A Multi-Scale Time-Frequency Analysis Method for Time Series Forecasting

By explicitly modeling mid-to-high frequency patterns often ignored by existing methods, FreqCycle unlocks state-of-the-art time series forecasting accuracy while maintaining faster inference.

Boya Zhang, Shuaijie Yin, Huiwen Zhu +1

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Vera V. Vishnyakova3w ago

Context Engineering: From Prompts to Corporate Multi-Agent Architecture

Prompt engineering is dead; long live context engineering—the key to scaling multi-agent AI systems lies in carefully designing the agent's informational environment, not just individual prompts.

Vera V. Vishnyakova

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Tool Use & Agents

3w ago

Evidential Perfusion Physics-Informed Neural Networks with Residual Uncertainty Quantification

Quantifying uncertainty in physics-informed neural networks for medical imaging boosts accuracy and reliability, leading to better stroke assessment.

Junhyeok Lee, Minseo Choi, Han Jang +5

Architecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug Design Training Efficiency & Optimization

3w ago

Causally Sufficient and Necessary Feature Expansion for Class-Incremental Learning

Stop CIL models from catastrophically forgetting by explicitly minimizing causal incompleteness within tasks and maximizing separability between tasks.

Zhen Zhang, Jielei Chu, Tianrui Li

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

3w ago·also Cohere

FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

FrameDiT achieves state-of-the-art video generation by ditching token-level attention for a novel matrix-based attention that operates directly on entire frames.

Minh Khoa Le, Kien Do, Duc Thanh Nguyen +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Norwegian University of Science and Technology3w ago·also Simula Research Laboratory

Temporal-Conditioned Normalizing Flows for Multivariate Time Series Anomaly Detection

Time series anomaly detection gets a boost from temporal-conditioned normalizing flows that capture complex temporal dynamics and uncertainty.

Helge Langseth, Kenth Engø-Monsen, Heri Ramampiaro

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Ashkan Panahi3w ago

A Gaussian Comparison Theorem for Training Dynamics in Machine Learning

Gordon's comparison theorem bridges the gap between complex ML training dynamics and tractable surrogate systems, offering a path to more accurate non-asymptotic analysis.

Ashkan Panahi

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Arihant Jain +33w ago

$P^2$GNN: Two Prototype Sets to boost GNN Performance

$P^2$GNN's plug-and-play prototype approach boosts GNN performance by injecting global context and denoising local neighborhoods, achieving state-of-the-art results across diverse datasets.

Arihant Jain, Gundeep Arora, Anoop Saladi +1

Architecture Design (Transformers, SSMs, MoE)Recommendation & Information Retrieval

Shuangfei Zhai3w ago

Exclusive Self Attention

Transformers get a surprising boost in language modeling performance by simply ignoring "themselves" during attention.

Shuangfei Zhai

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Simon Brandt +43w ago

A Variational Latent Equilibrium for Learning in Cortex

Bridging the gap between deep learning and neuroscience, this work presents a biologically plausible alternative to backpropagation through time, potentially unlocking new avenues for brain-inspired AI.

Simon Brandt, Paul Haider, Walter Senn +2

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

3w ago

An Empirical Study and Theoretical Explanation on Task-Level Model-Merging Collapse

Forget parameter conflicts: representational incompatibility is the real culprit behind LLM merging failures, setting fundamental limits on which tasks can be successfully combined.

Yuan Cao, Dezhi Ran, Yuzhe Guo +4

Architecture Design (Transformers, SSMs, MoE)Open-Source Models & Weights Training Efficiency & Optimization

Xiaoyu Ding +23w ago

YOLO-NAS-Bench: A Surrogate Benchmark with Self-Evolving Predictors for YOLO Architecture Search

YOLO architecture search can now be sped up dramatically: a new surrogate benchmark lets you evaluate designs without full training, and it's good enough to find architectures that beat YOLOv12.

Xiaoyu Ding, Jiaxin Zheng, Yongtao Wang

Architecture Design (Transformers, SSMs, MoE)Computer Vision Eval Frameworks & Benchmarks

James A. Michaelov +13w ago

N-gram-like Language Models Predict Reading Time Best

State-of-the-art language models might be too sophisticated: simpler n-gram statistics better explain human reading times.

James A. Michaelov, Roger P. Levy

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing

Hsiao-Ying Huang +13w ago

SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models

Forget confidence scores: a modality-aware early exit strategy for spoken language models slashes decoding costs without sacrificing accuracy or perceptual quality, revealing that speech tokens require specialized handling compared to text.

Hsiao-Ying Huang, Cheng-Han Chiang

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Speech & Audio

Freeman Cheng +33w ago

ReCoSplat: Autoregressive Feed-Forward Gaussian Splatting Using Render-and-Compare

Forget SLAM, ReCoSplat uses a "Render-and-Compare" module to autoregressively refine Gaussian Splatting reconstructions, even from unposed video, achieving SOTA novel view synthesis.

Freeman Cheng, Xueting Li, Junqi You +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

SphereLab.aidiagdistill3w ago·also × speedup over the undistilled model., UC Merced, Westlake

Streaming Autoregressive Video Generation via Diagonal Distillation

Achieve a 277x speedup in autoregressive video generation by distilling diffusion models with a novel "diagonal distillation" approach that leverages temporal context and mitigates error propagation.

Jinxiu Liu, Xuan Liu, Xuanming Liu +6

Architecture Design (Transformers, SSMs, MoE)Computer Vision Inference & Quantization

Zhiping Wu +13w ago

Vision-Augmented On-Track System Identification for Autonomous Racing via Attention-Based Priors and Iterative Neural Correction

Autonomous racecars can now learn tire dynamics 71% faster and with 60% higher accuracy by "seeing" the road surface and remembering past driving behavior.

Zhiping Wu, Hongye Su

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

Xiamen University3w ago·also Shanghai Innovation, TeleAI, USTC

Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation

Don't fully retrain your draft model after fine-tuning your LLM: EDA restores speculative decoding performance with significantly less compute by adapting only a small, private component and regenerating training data.

Luxi Lin, Zhihang Lin, Zhanpeng Zeng +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Meta AI3w ago

ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning

Forget imbalanced LoRA usage: ReMix leverages reinforcement learning to route effectively among LoRAs, boosting performance in parameter-efficient fine-tuning.

Ruizhong Qiu, Hanqing Zeng, Yinglong Xia +14

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

3w ago

A saccade-inspired approach to image classification using visiontransformer attention maps

Mimicking human eye movements with a Vision Transformer's attention maps yields a surprisingly effective and efficient image classification strategy.

Matthis Dallain, Laurent Rodriguez, Laurent Udo Perrinet +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

MIT CSAIL3w ago

The Radio-Frequency Transformer for Signal Separation

Beat the state-of-the-art in radio signal separation by 122x using a transformer trained on cross-entropy loss, and the same architecture could work for gravitational waves.

Egor Lifar, Semyon Savkin, Rachana Madhukara +3

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Rian Atri3w ago

Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training

Achieve more efficient reasoning in Transformers without increasing test-time cost by using training-only techniques that guide attention and dynamically adjust sharpness.

Rian Atri

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Training Efficiency & Optimization

A. M. A. S. D. Alagiyawanna +13w ago

Evolution of Photonic Quantum Machine Learning under Noise

Noise in photonic quantum systems severely limits the performance of quantum machine learning algorithms, demanding robust noise mitigation strategies for practical implementations.

A. M. A. S. D. Alagiyawanna, Asoka Karunananda

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Scientific Discovery & Drug Design

Alessio Masano +43w ago

Routing without Forgetting

Forget gradient descent: this new method routes transformer activations through a Hopfield-inspired memory in a single forward pass to achieve state-of-the-art online continual learning.

Alessio Masano, Giovanni Bellitto, Dipam Goswani +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

3w ago

Transductive Generalization via Optimal Transport and Its Application to Graph Node Classification

Optimal transport provides a surprisingly tight and efficiently computable bound on transductive generalization in graph node classification, revealing how GNN depth impacts representation geometry.

MoonJeong Park, Seungbeom Lee, Kyungmin Kim +5

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

3w ago·also Meta AI

Wrong Code, Right Structure: Learning Netlist Representations from Imperfect LLM-Generated RTL

Imperfect code from LLMs can still teach AI to understand circuit structure, unlocking a scalable path to netlist representation learning without expensive, clean datasets.

Siyang Cai, Cangyuan Li, Yinhe Han +1

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Data Curation & Synthetic Data

Robin Doerfler +13w ago

Physics-Informed Neural Engine Sound Modeling with Differentiable Pulse-Train Synthesis

Forget black-box audio synthesis: this differentiable engine sound model gives you interpretable knobs to control physical parameters like valve dynamics and exhaust resonances.

Robin Doerfler, Lonce Wyse

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

3w ago

Lost in Backpropagation: The LM Head is a Gradient Bottleneck

LLMs suffer from a severe gradient bottleneck in the output layer, suppressing 95-99% of the gradient norm and crippling training.

Nathan Godey, Yoav Artzi

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

3w ago

Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference

Mamba-2's efficiency doesn't require custom CUDA kernels: XLA's compiler optimizations are enough to unlock near-optimal performance across diverse hardware.

Cosmo Santoni, C. Santoni

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Open-Source Models & Weights

Tsinghua AI3w ago·also UT Austin

When to Lock Attention: Training-Free KV Control in Video Diffusion

Achieve better video editing without retraining by dynamically locking background features based on a "hallucination metric" that detects when the diffusion model is about to go astray.

Tianyi Zeng, Jincheng Gao, Tianyi Wang +8

Architecture Design (Transformers, SSMs, MoE)Computer Vision

DeepMind3w ago

Quantifying the Necessity of Chain of Thought through Opaque Serial Depth

Mixture-of-Experts models might be hiding more of their reasoning than we thought, thanks to a newly quantified "opaque serial depth" metric.

Jonah Brown-Cohen, David Lindner, Rohin Shah

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Mar 9, 2026

3w ago·also Central Institute of Mental Health, Department of Theoretical Neuroscience, Medical Faculty Mannheim

Electrocardiogram Classification with Transformers Using Koopman and Wavelet Features

Forget wavelets, transformers with Koopman operator-derived features unlock superior ECG classification, especially in complex multi-class scenarios.

Sucheta Ghosh, Zahra Monfared

Architecture Design (Transformers, SSMs, MoE)Speech & Audio