March 11 – March 18, 2026

Architecture Design (Transformers, SSMs, MoE) - Weekly Roundup

100 papers published across 4 labs.

2% acceleration

Selected Labs publishing this week

Tsinghua AI3 CMU ML1 Stanford HAI1 Amazon Science1

Top Papers

Mar 18, 2026

2w ago

Requirements volatility in software architecture design: an exploratory case study

Requirements volatility doesn't just delay projects; it directly undermines software architecture, leading to technical debt and scheduling nightmares.

Sanja Aaramaa, Sandun Dasanayake, M. Oivo +4

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis

Houston Haynes +12w ago

The Program Hypergraph: Multi-Way Relational Structure for Geometric Algebra, Spatial Compute, and Physics-Aware Compilation

Unlock geometric algebra's performance potential in neural networks and spatial computing by compiling directly from multi-way relationships, eliminating manual specialization and ensuring geometric correctness.

Houston Haynes, H. Haynes

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Distributed Systems & Hardware

2w ago·also MAGNET

DDH-based schemes for multi-party Function Secret Sharing

Multi-party function secret sharing just got a whole lot more practical: a new DDH-based scheme slashes key sizes by up to 10x.

Marc Damie, Florian Hahn, Andreas Peter +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

2w ago

AdaMuS: Adaptive Multi-view Sparsity Learning for Dimensionally Unbalanced Data

AdaMuS overcomes the bias towards high-dimensional data in multi-view learning by adaptively pruning redundant parameters and sparsely fusing views, leading to improved performance on dimensionally unbalanced data.

Cai Xu, Changhao Sun, Ziyu Guan

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Training Efficiency & Optimization

José Palazzo Moreira de Oliveira2w ago

From Symbol to Meaning: Ontological and Philosophical Reflections on Large Language Models in Information Systems Engineering

LLMs aren't just better tools; they're forcing us to rethink the very nature of information, knowledge, and meaning in system design.

José Palazzo Moreira de Oliveira

Architecture Design (Transformers, SSMs, MoE)Constitutional AI & AI Ethics Natural Language Processing

All Papers (100)

Mar 18, 2026

2w ago

AdaMuS: Adaptive Multi-view Sparsity Learning for Dimensionally Unbalanced Data

Cai Xu, Changhao Sun, Ziyu Guan

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Training Efficiency & Optimization

José Palazzo Moreira de Oliveira2w ago

From Symbol to Meaning: Ontological and Philosophical Reflections on Large Language Models in Information Systems Engineering

LLMs aren't just better tools; they're forcing us to rethink the very nature of information, knowledge, and meaning in system design.

José Palazzo Moreira de Oliveira

Architecture Design (Transformers, SSMs, MoE)Constitutional AI & AI Ethics Natural Language Processing

Zirui Li +92w ago·also KU, Sofia University "St. Kliment Ohridski"

Video Understanding: From Geometry and Semantics to Unified Models

The field of video understanding is rapidly shifting from isolated pipelines to unified models capable of adapting to diverse downstream tasks, demanding a re-evaluation of current approaches.

Zirui Li, Mingqiao Ye, Feng Qiao +7

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

2w ago·also Fudan

MOSS-TTS Technical Report

Achieve controllable and scalable speech generation with MOSS-TTS, enabling zero-shot voice cloning and long-form synthesis.

Yitian Gong, Y. Gong, Botian Jiang +28

Architecture Design (Transformers, SSMs, MoE)Open-Source Models & Weights Speech & Audio

Young-Bin Park +12w ago

Graph-Native Cognitive Memory for AI Agents: Formal Belief Revision Semantics for Versioned Memory Architectures

Forget finetuning – Kumiho's graph-native memory lets you swap in a better LLM and instantly double your agent's reasoning accuracy on complex cognitive tasks.

Young-Bin Park, Young Bin Park

Architecture Design (Transformers, SSMs, MoE)Reasoning & Chain-of-Thought Tool Use & Agents

Xianhang Cheng +32w ago

Steering Video Diffusion Transformers with Massive Activations

Video diffusion transformers exhibit a hidden "magnitude hierarchy" in their activations that can be exploited for training-free quality improvements via a simple steering method.

Xianhang Cheng, Yujian Zheng, Zhenyu Xie +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Interpretability & Mechanistic Interp

2w ago·also Adobe Research

LoST: Level of Semantics Tokenization for 3D Shapes

Forget geometric LODs: tokenizing 3D shapes by semantic salience unlocks SOTA reconstruction and efficient autoregressive generation with 10x-1000x fewer tokens.

Niladri Shekhar Dutt, Niladri Shekhar Dutt, Zifan Shi +11

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Antônio Junior Alves Caiado +12w ago

Dropout Robustness and Cognitive Profiling of Transformer Models via Stochastic Inference

Forget scaling laws: dropout robustness in transformers is a lottery, with smaller models sometimes showing perfect stability while larger models crumble under stochastic inference.

Antônio Junior Alves Caiado, Michael Hahsler

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

Yang-Tian Sun +62w ago

Stereo World Model: Camera-Guided Stereo Video Generation

Generate consistent stereo videos directly from RGB data, bypassing depth estimation and monocular-to-stereo conversion, with StereoWorld's novel camera-aware attention mechanisms.

Yang-Tian Sun, Zehuan Huang, Yifan Niu +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision World Models & Planning

2w ago

CWoMP: Morpheme Representation Learning for Interlinear Glossing

Unlock faster, more accurate interlinear glossing for low-resource languages by treating morphemes as atomic units, outperforming existing methods and enabling user-guided lexicon expansion without retraining.

Morris Alper, Enora Rice, Bhargav Shandilya +2

Architecture Design (Transformers, SSMs, MoE)Data Curation & Synthetic Data Natural Language Processing

2w ago

Atomic Trajectory Modeling with State Space Models for Biomolecular Dynamics

Generate realistic, atom-level molecular dynamics trajectories orders of magnitude faster with a novel State Space Model that captures long-range dependencies in biomolecular systems.

Liang Shi, Jiarui Lu, Junqi Liu +3

Architecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug Design

Marijn Ruiter +42w ago

RHYME-XT: A Neural Operator for Spatiotemporal Control Systems

Ditch costly PIDE integration: RHYME-XT learns the flow map directly, offering a continuous-time, discretization-invariant representation that beats state-of-the-art neural operators.

Marijn Ruiter, Miguel Aguiar, Jake Rap +2

Architecture Design (Transformers, SSMs, MoE)Robotics & Embodied AI Scientific Discovery & Drug Design

Mengyu Bu2w ago

Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality

LLMs can get a massive multilingual boost, especially in low-resource languages, by offloading translation to specialized models and carefully aligning their representations.

Mengyu Bu

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Open-Source Models & Weights

Yihong Chen2w ago

Attention Sinks Induce Gradient Sinks

Attention sinks aren't just a forward-pass phenomenon; they actively warp the training landscape by creating "gradient sinks" that drive massive activations.

Yihong Chen

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Training Efficiency & Optimization

CMU ML2w ago

Modeling Overlapped Speech with Shuffles

Achieve single-pass alignment of multi-talker speech – a feat previously impossible – by modeling overlaps as shuffles.

Matthew Wiesner, Samuele Cornell, Alexander Polok +5

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Ahmet Kaplan2w ago

Auto-Unrolled Proximal Gradient Descent: An AutoML Approach to Interpretable Waveform Optimization

Achieve near-optimal waveform optimization with 98.8% spectral efficiency using a 5-layer, AutoML-tuned unrolled proximal gradient descent network trained on just 100 samples.

Ahmet Kaplan

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Bassam Adnan +32w ago

ArchBench: Benchmarking Generative-AI for Software Architecture Tasks

Software architecture, a critical but underspecified domain, finally gets a unified benchmarking platform with ArchBench, enabling standardized evaluation of LLMs on complex system design tasks.

Bassam Adnan, Aviral Gupta, Sreemaee Akshathala +1

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Eval Frameworks & Benchmarks

2w ago

Revisiting Cross-Attention Mechanisms: Leveraging Beneficial Noise for Domain-Adaptive Learning

Injecting "beneficial noise" into cross-attention mechanisms can significantly improve unsupervised domain adaptation by forcing models to focus on content rather than style distractions.

Zelin Zang, Yehui Yang, Fei Wang +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Huan Song +72w ago

Ruyi2.5 Technical Report

Ruyi2.5 achieves comparable performance to Qwen3-VL on general multimodal benchmarks while significantly outperforming it in privacy-constrained surveillance, demonstrating the effectiveness of its edge-cloud architecture.

Huan Song, Shuyu Tian, Qingfei Zhao +5

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Open-Source Models & Weights

2w ago

Requirements volatility in software architecture design: an exploratory case study

Requirements volatility doesn't just delay projects; it directly undermines software architecture, leading to technical debt and scheduling nightmares.

Sanja Aaramaa, Sandun Dasanayake, M. Oivo +4

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis

2w ago

GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

Synthesizing realistic 6-DOF object manipulation trajectories in complex 3D environments just got a whole lot better with GMT, a multimodal transformer that substantially outperforms existing methods.

Huajian Zeng, Huajian Zeng, Abhishek Saroha +2

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Robotics & Embodied AI

2w ago·also HKU

PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation

By disentangling semantic and contextual cues in vision-language models, PCA-Seg achieves state-of-the-art open-vocabulary segmentation with only 0.35M additional parameters per block.

Jianjian Yin, Tao Chen, Yi Chen +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Jie Lei +42w ago

Enabling RISC-V Vector Code Generation in MLIR through Custom xDSL Lowerings

Achieve up to 2.4x speedup over OpenBLAS on RISC-V by using MLIR and xDSL to generate optimized RVV code, finally unlocking the potential of RISC-V vector extensions.

Jie Lei, Héctor Martínez, H. Mart'inez +2

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Distributed Systems & Hardware

2w ago

ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

Training video diffusion models with pixel-wise losses just got a whole lot cheaper: ChopGrad reduces memory complexity from linear to constant with video length.

Dmitriy Rivkin, Parker Ewen, Lili Gao +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Nil Ayday +22w ago

Gaussian Process Limit Reveals Structural Benefits of Graph Transformers

Graph transformers avoid oversmoothing in deep layers by structurally preserving community information, a theoretical advantage over GCNs revealed through Gaussian process limits.

Nil Ayday, Lingchu Yang, Debarghya Ghoshdastidar

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing

Xingyu Chen +32w ago

Consistency-Driven Dual LSTM Models for Kinematic Control of a Wearable Soft Robotic Arm

Cycle consistency training unlocks stable and accurate inverse kinematics for wearable soft robots, even with their inherent nonlinearities and hysteresis.

Xingyu Chen, Yi Xiong, Yifu Xiong +1

Architecture Design (Transformers, SSMs, MoE)Robotics & Embodied AI

University of Pavia2w ago·also Euler Institute

Translation Invariance of Neural Operators for the FitzHugh-Nagumo Model

Convolutional Neural Operators (CNOs) surprisingly excel at capturing translated dynamics in the FitzHugh-Nagumo model, despite other architectures achieving lower training error or faster inference.

Luca Pellegrini

Architecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug Design Training Efficiency & Optimization

Qihong Tang +62w ago

Prompt-Free Universal Region Proposal Network

Forget prompt engineering: this new region proposal network spots objects across diverse datasets without *any* text or image prompts.

Qihong Tang, Qi Tang, Changhan Liu +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Leonardo Del Grande +22w ago

A Dual Certificate Approach to Sparsity in Infinite-Width Shallow Neural Networks

Infinite neural nets can be sparse, and this paper proves it, showing that total variation regularization provably yields sparse solutions in infinite-width shallow ReLU networks, with sparsity bounds tied to the geometry of the data.

Leonardo Del Grande, Christoph Brune, Marcello Carioni

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

2w ago·also Shenzhen Institute of Advanced

Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates

Ditch the feature engineering: Baguan-TS lets you use raw time series sequences directly for in-context forecasting, outperforming traditional methods.

Linxiao Yang, Xue Jiang, Gezheng Xu +9

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Dibakar Sigdel2w ago

The Phasor Transformer: Resolving Attention Bottlenecks on the Unit Circle

Ditch quadratic attention bottlenecks: this new transformer variant achieves competitive time-series forecasting with O(N log N) complexity by representing sequence states on a unit circle.

Dibakar Sigdel

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

2w ago·also HIT

ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

Lossless compression can actually *speed up* LLM inference on GPUs, not just shrink model size, thanks to ZipServ's hardware-aware design.

Ruibo Fan, Xiangrui Yu, Xinglin Pan +6

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Hamed Taheri2w ago

Governed Memory: A Production Architecture for Multi-Agent Workflows

Enterprise AI can achieve 50% token reduction and zero cross-entity leakage by implementing a shared, governed memory architecture for multi-agent workflows.

Hamed Taheri

Architecture Design (Transformers, SSMs, MoE)Constitutional AI & AI Ethics Tool Use & Agents

Houston Haynes +12w ago

The Program Hypergraph: Multi-Way Relational Structure for Geometric Algebra, Spatial Compute, and Physics-Aware Compilation

Houston Haynes, H. Haynes

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Distributed Systems & Hardware

H. Haynes2w ago

Adaptive Domain Models: Bayesian Evolution, Warm Rotation, and Principled Training for Geometric and Neuromorphic AI

Forget training behemoths: ADMs slash memory overhead to just twice the inference footprint while guaranteeing geometric correctness and continuous adaptation.

H. Haynes

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Panuganti Chirag Sai +92w ago

ReLMXEL: Adaptive RL-Based Memory Controller with Explainable Energy and Latency Optimization

Achieve significant latency and energy savings in memory systems with an RL-based controller that also provides insights into *why* its decisions are optimal.

Panuganti Chirag Sai, Panuganti Chirag Sai, Gandholi Sarat +7

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Tim Oh2w ago

A Synthesizable RTL Implementation of Predictive Coding Networks

Ditch backprop's limitations: this synthesizable RTL implementation brings predictive coding networks to life in fully distributed hardware.

Tim Oh

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Sohaib Errabii +52w ago

KANtize: Exploring Low-bit Quantization of Kolmogorov-Arnold Networks for Efficient Inference

KANs get a 50x BitOps reduction without accuracy loss by quantizing their B-splines down to 2-3 bits and using lookup tables.

Sohaib Errabii, Sohaib Errabii, Olivier Sentieys +3

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Shih-Heng Wang +62w ago

Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information

Acoustic and phonetic NACs encode accent in fundamentally different ways, with implications for how we interpret and manipulate these representations.

Shih-Heng Wang, Tiantian Feng, Aditya Kommineni +4

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Speech & Audio

2w ago·also CAS, Northwestern

TimeAPN: Adaptive Amplitude-Phase Non-Stationarity Normalization for Time Series Forecasting

By explicitly modeling and predicting non-stationary factors in both time and frequency domains, TimeAPN significantly boosts the accuracy of long-term time series forecasting, outperforming existing normalization techniques.

Jialiang Tang, Siwei Yu, Baosheng Yu +2

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Multiverse Computing2w ago·also Donostia International Physics Center, Ikerbasque Foundation for Science, Tecnun - University of Navarra

Only relative ranks matter in weight-clustered large language models

LLMs can be drastically compressed without retraining because the relative ordering of weights matters far more than their exact values, opening the door to efficient, training-free compression techniques.

Borja Aizpurua, Sukhbinder Singh, Rom'an Or'us +1

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Open-Source Models & Weights

Zhongzhu Zhou +82w ago·also BUPT

CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

Forget SVD: CARE aligns low-rank attention approximations with input activations, boosting accuracy up to 1.7x and slashing perplexity by 215x when converting models to multi-head latent attention.

Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen +6

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

2w ago

Adaptive Guidance for Retrieval-Augmented Masked Diffusion Models

No training needed: ARAM dynamically adjusts retrieved context guidance in masked diffusion models based on signal quality, resolving retrieval-prior conflicts on the fly.

Jaemin Kim, Jong Chul Ye

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Recommendation & Information Retrieval

Binqing Wu +52w ago

AirDDE: Multifactor Neural Delay Differential Equations for Air Quality Forecasting

By explicitly modeling pollutant propagation delays with neural delay differential equations, AirDDE significantly improves air quality forecasting accuracy.

Binqing Wu, Zongjiang Shang, Shiyu Liu +3

Architecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug Design

Alex Anvi Eponon +72w ago

How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence

AI's current limitations in adaptability stem from its reliance on psychological learning theories, suggesting a need for representational architectures where systematic behavior is inherent, not accidental.

Alex Anvi Eponon, Alex Eponon, Ildar Batyrshin +5

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing

Rui Wu +22w ago

Cohomological Obstructions to Global Counterfactuals: A Sheaf-Theoretic Foundation for Generative Causal Models

Generative models can fail to produce globally consistent counterfactuals when causal graphs have complex topologies, but a novel sheaf-theoretic framework with entropic regularization can overcome these limitations.

Rui Wu, Hong Xie, Yongjun Li

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Scientific Discovery & Drug Design

Obvious Research2w ago·also Sorbonne

FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion

Achieve 4K image-to-video generation with diffusion models without training by cleverly fusing tiled denoising with a low-resolution latent prior, balancing detail and global coherence.

Hugo Caselles-Dupré, Mathis Koroglu, Guillaume Jeanneret +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

2w ago

CLeAN: Continual Learning Adaptive Normalization in Dynamic Environments

A simple adaptive normalization technique can significantly improve continual learning performance on tabular data by mitigating catastrophic forgetting in dynamic environments.

Isabella Marasco, Davide Evangelista, Elena Loli Piccolomini +1

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Tae Eun Choi +32w ago

Anchoring and Rescaling Attention for Semantically Coherent Inbetweening

Synthesizing realistic intermediate video frames just got a whole lot better, thanks to a novel attention mechanism that anchors to keyframes and text prompts for improved consistency and semantic alignment.

Tae Eun Choi, Sumin Shim, Junhyeok Kim +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Jaein Kim +32w ago

Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis

Achieve SE(3) equivariance and memory scalability in point cloud analysis with coordinate-based kernels, outperforming state-of-the-art equivariant methods on diverse tasks.

Jaein Kim, Hee Bin Yoo, Dong-Sig Han +1

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

Roy Henha Eyono +42w ago

Inhibitory normalization of error signals improves learning in neural circuits

Normalizing error signals, not just activations, is the key to unlocking the benefits of inhibition-mediated normalization for learning in neural networks.

Roy Henha Eyono, Daniel Levenstein, Arna Ghosh +2

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Stanford HAI2w ago

Humans and transformer LMs: Abstraction drives language learning

Transformer LMs learn linguistic abstractions before memorizing specific lexical items, mirroring key aspects of human language acquisition.

Jasper Jian, Christopher D. Manning

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing

Warsaw University of Technology2w ago

DSS-GAN: Directional State Space GAN with Mamba backbone for Class-Conditional Image Synthesis

Mamba, the darling of sequence modeling, now powers a GAN that beats StyleGAN2-ADA in image synthesis, thanks to a clever latent space routing trick.

Aleksander Ogonowski, Konrad Klimaszewski, Przemysław Rokita

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Ruhr University Bochum2w ago·also GMV Spain, NEC Laboratories Europe

On Securing the Software Development Lifecycle in IoT RISC-V Trusted Execution Environments

Secure enclave updates and migrations, previously missing from RISC-V TEEs, are now practical thanks to a novel toolkit that adds minimal overhead.

Annika Wilde, Samira Briongos, Claudio Soriente +2

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Distributed Systems & Hardware

2w ago·also MAGNET

DDH-based schemes for multi-party Function Secret Sharing

Multi-party function secret sharing just got a whole lot more practical: a new DDH-based scheme slashes key sizes by up to 10x.

Marc Damie, Florian Hahn, Andreas Peter +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Tsinghua AI2w ago

CodeT5-RNN: Reinforcing Contextual Embeddings for Enhanced Code Comprehension

LLMs struggle with code comprehension, but a simple RNN pass over their embeddings can boost accuracy by over 5%.

Md Mostafizer Rahman, Ariful Islam Shiplu, Yutaka Watanobe +3

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Natural Language Processing

Sizhuang He +32w ago

Learning Permutation Distributions via Reflected Diffusion on Ranks

By mapping permutations to a continuous space of "soft ranks," this new diffusion approach makes learning permutation distributions far more tractable, especially for long sequences.

Sizhuang He, Yangtian Zhang, Shiyang Zhang +1

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing

Xinze Li +42w ago

S-VGGT: Structure-Aware Subscene Decomposition for Scalable 3D Foundation Models

By reorganizing 3D scenes into structurally-aware subscenes, S-VGGT offers a parallel geometric bridge for efficient processing, slashing global attention costs without compromising reconstruction fidelity.

Xinze Li, Pengxu Chen, Yiyuan Wang +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

2w ago

Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training

Ditch the polar decomposition: MUD offers a surprisingly simple and efficient alternative for momentum whitening, speeding up transformer training by up to 50% compared to AdamW and Muon.

Ben S. Southworth, Stephen Thomas

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Sirong Piao +122w ago

Deep Learning-Based Airway Segmentation in Systemic Lupus Erythematosus Patients with Interstitial Lung Disease (SLE-ILD): A Comparative High-Resolution CT Analysis

AI spots a hidden pattern in lung scans of lupus patients, revealing that specific airway dilations in the upper lobes could be a telltale sign of interstitial lung disease.

Sirong Piao, Ying Ming, Ruijie Zhao +10

Architecture Design (Transformers, SSMs, MoE)Computer Vision Scientific Discovery & Drug Design

University2w ago

Motion-Adaptive Temporal Attention for Lightweight Video Generation with Stable Diffusion

Achieve competitive video generation with Stable Diffusion using only 2.9% additional parameters by adapting temporal attention based on motion content, outperforming methods with explicit temporal consistency losses.

Rui Hong, Shuxue Quan

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Amazon Science2w ago

Learning When to Attend: Conditional Memory Access for Long-Context LLMs

LLMs can maintain performance while skipping global attention for 80% of tokens, slashing compute costs and memory footprint in long-context scenarios.

Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

2w ago

Anisotropic Permeability Tensor Prediction from Porous Media Microstructure via Physics-Informed Progressive Transfer Learning with Hybrid CNN-Transformer

Predicting permeability tensors from microstructure images just got 33% more accurate thanks to a physics-informed CNN-Transformer that learns faster and generalizes better via pretraining and differentiable constraints.

Mohammad Nooraiepour

Architecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug Design Training Efficiency & Optimization

2w ago

PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery

Panoramic 3D reconstruction gets a boost with PanoVGGT, a Transformer that handles spherical distortions and global-frame ambiguity to deliver state-of-the-art accuracy in a single pass.

Yijing Guo, Mengjun Chao, Luo Wang +5

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

2w ago·also Ruhr University Bochum

SoK: From Silicon to Netlist and Beyond $-$ Two Decades of Hardware Reverse Engineering Research

Reproducibility in hardware reverse engineering is shockingly low, with only 4% of evaluated artifacts from 187 papers yielding reproducible results.

Zehra Karadağ, Zehra Karadaug, Simon Klix +8

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

2w ago

Federated Computing as Code (FCaC): Sovereignty-aware Systems by Design

Federated Computing as Code lets you enforce data sovereignty in federated systems with cryptographic guarantees, moving beyond runtime policies and trust assumptions.

Enzo Fenoglio, Enzo Fenoglio, Philip Treleaven +1

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Distributed Systems & Hardware

Yue Zhao +52w ago

Pretrained Multilingual Transformers Reveal Quantitative Distance Between Human Languages

Multilingual transformers spontaneously learn a geometric representation of language distance, and we can extract it to improve low-resource translation.

Yue Zhao, Jiatao Gu, Paloma Jeretivc +3

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing

Mohamed Eltahir +52w ago

VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Forget collapsing videos into text – this hierarchical grid lets you zoom into any moment with lossless visual fidelity, unlocking logarithmic compute scaling for long-form video understanding.

Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models+1

Yanke Yu +52w ago

Discovering Decoupled Functional Modules in Large Language Models

LLMs aren't monolithic black boxes: they contain spatially organized, functionally specialized modules that can be automatically discovered.

Yanke Yu, Jin Li, Ying Sun +3

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

Shanghai Institute for Mathematics and Interdisciplinary Sciences (SIMIS)2w ago·also Fudan

Variational Kernel Design for Internal Noise: Gaussian Chaos Noise, Representation Compatibility, and Reliable Deep Learning

Forget dropout – Gaussian Chaos Noise offers provable control over representation deformation and boosts calibration in deep networks.

Ziran Liu

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

2w ago·also AGI Lab, Westlake-AGI-Lab/CleanStyle

Few-Step Diffusion Sampling Through Instance-Aware Discretizations

Instance-specific timestep schedules can significantly boost diffusion model performance, challenging the reliance on global discretization strategies.

Liangyu Yuan, Ruoyu Wang, Tong Zhao +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

2w ago

Towards Infinitely Long Neural Simulations: Self-Refining Neural Surrogate Models for Dynamical Systems

Autoregressive neural surrogates can now simulate dynamical systems for infinitely long horizons, thanks to a novel self-refining diffusion model that avoids error compounding.

Qi Liu, Laure Zanna, Joan Bruna

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization World Models & Planning

2w ago

Multi-stage Flow Scheduling for LLM Serving

LLM serving systems can boost Time-To-First-Token (TTFT) attainment by up to 2.4x simply by prioritizing network flows based on a novel approximation of Least-Laxity-First scheduling.

Yijun Sun, Xudong Liao, Songru Xie +10

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Tuowei Wang +32w ago

Swarm: Co-Activation Aware KVCache Offloading Across Multiple SSDs

Forget slow, single-SSD paging: Swarm unlocks 2.7x higher bandwidth for LLM KV-cache offloading by exploiting stable co-activation patterns to parallelize I/O across multiple SSDs.

Tuowei Wang, Liyun Chu, Ruwen Fan +1

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Mar 17, 2026

Tsinghua AI2w ago

Making Separation-First Multi-Stream Audio Watermarking Feasible via Joint Training

Jointly training audio watermarking and source separation unlocks robust multi-stream watermarking, enabling independent tracking of individual audio components within a mix.

Houmin Sun, Zipei Hu, Zi Hu +4

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

2w ago·also Snap Research

Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation

Compressing images into 1D token sequences can yield state-of-the-art reconstruction fidelity, challenging the necessity of 2D spatial grids for visual tokenization.

Yunpeng Qu, Kaidong Zhang, Yukang Ding +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Jiaxin Zhang +52w ago

GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models

Text-heavy fine-tuning is blinding your MLLM to crucial 3D spatial information, but GAP-MLLM's geometry-aligned pre-training can restore its sight.

Jiaxin Zhang, Junjun Jiang, Haijie Li +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Tsinghua AI2w ago

TCATSeg: A Tooth Center-Wise Attention Network for 3D Dental Model Semantic Segmentation

By explicitly modeling tooth relationships, TCATSeg achieves state-of-the-art accuracy in 3D dental model segmentation, even in challenging pre-orthodontic cases.

Qiang He, Wentian Qu, Jiajia Dai +9

Architecture Design (Transformers, SSMs, MoE)Computer Vision Scientific Discovery & Drug Design

2w ago

FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data

Forget quadratic attention: FEAT achieves state-of-the-art performance on structured data with linear complexity and 40x faster inference.

Zhenghang Song, Tang Qian, Lu Chen +7

Architecture Design (Transformers, SSMs, MoE)Scaling Laws & Emergent Abilities Training Efficiency & Optimization

José Sánchez +82w ago

Reasoning About Variability Models Through Network Analysis

Feature models, often treated as static configuration spaces, reveal hidden structural patterns and domain-specific deviations when viewed through the lens of network analysis.

José Sánchez, Jose Manuel Sanchez, M. Á. Olivero +6

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Natural Language Processing

Shihao Zhu +92w ago

Mixture of Style Experts for Diverse Image Stylization

Diffusion models can now capture nuanced semantic and material details in image stylization, moving beyond simple color-driven transformations, thanks to a Mixture of Experts architecture.

Shihao Zhu, Zi-Juan Ouyang, Ziheng Ouyang +7

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

2w ago

MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models

Masked diffusion language models can now achieve 21.8x better compute efficiency than autoregressive models, thanks to binary encoding and index shuffling.

Chen-Hao Chao, Weiye Sun, Wei-Fang Sun +3

Architecture Design (Transformers, SSMs, MoE)Scaling Laws & Emergent Abilities Training Efficiency & Optimization

2w ago

CAST-TTS: A Simple Cross-Attention Framework for Unified Timbre Control in TTS

Ditch the separate models: CAST-TTS uses a single cross-attention mechanism to control TTS timbre from both speech and text, rivaling specialized models in quality.

Zihao Zheng, Wen Wu, Chao Zhang +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Joseph Cameron +22w ago

Evaluating Latent Space Structure in Timbre VAEs: A Comparative Study of Unsupervised, Descriptor-Conditioned, and Perceptual Feature-Conditioned Models

Forget one-hot encodings: conditioning timbre VAEs on continuous perceptual features unlocks more compact and controllable latent spaces.

Joseph Cameron, Alan Blackwell, Alan F. Blackwell

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Speech & Audio

Moritz Pawlowsky +52w ago

What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers

DINOv2's powerful visual features come with a hidden flaw: strong positional biases that ALiBi positional encoding can effectively mitigate.

Moritz Pawlowsky, Antonis Vamvakeros, Alexander Weiss +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Scientific Discovery & Drug Design

Hongwei Lin +22w ago

AW-MoE: All-Weather Mixture of Experts for Robust Multi-Modal 3D Object Detection

Autonomous vehicles can now see through the storm: a new Mixture of Experts approach boosts 3D object detection accuracy by 15% in adverse weather, without slowing things down.

Hongwei Lin, Xun Huang, Chenglu Wen

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Peng Zhang2w ago

RepoReviewer: A Local-First Multi-Agent Architecture for Repository-Level Code Review

RepoReviewer tackles the complexity of repository-level code review with a multi-agent architecture, breaking down the monolithic process into manageable stages for more relevant and efficient feedback.

Peng ZhangCode

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Tool Use & Agents

Dalhousie University2w ago

Energy Flow Graph: Modeling Software Energy Consumption

Software energy consumption isn't just an aggregate number – it's a path-dependent journey, and this new model reveals hidden optimization opportunities that can slash energy use by up to 705x.

Saurabhsingh Rajput, Tushar Sharma

Architecture Design (Transformers, SSMs, MoE)Code Generation & Program Synthesis Training Efficiency & Optimization

Yiming Huang +52w ago

Leveling3D: Leveling Up 3D Reconstruction with Feed-Forward 3D Gaussian Splatting and Geometry-Aware Generation

By combining feed-forward 3D reconstruction with a geometry-aware diffusion model, Leveling3D fills in the gaps in extrapolated novel views, leveling up both 3D reconstruction and generation.

Yiming Huang, Baixiang Huang, Beilei Cui +3

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Trong-Duc Nguyen +22w ago

Synergizing Deep Learning and Biological Heuristics for Extreme Long-Tail White Blood Cell Classification

By injecting biological heuristics into a deep learning pipeline, this method achieves state-of-the-art performance in classifying rare white blood cell subtypes, a task where standard deep learning models often fail.

Trong-Duc Nguyen, Hoang-Long Nguyen, Huy-Hieu Pham

Architecture Design (Transformers, SSMs, MoE)Computer Vision Scientific Discovery & Drug Design

Mattia Rigotti +22w ago

GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators

You can now train graph transformers that generalize across different mesh resolutions, thanks to a new architecture that maintains gauge invariance while scaling linearly.

Mattia Rigotti, Nicholas Thumiger, Thomas Frick

Architecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug Design

2w ago·also University of South Carolina

SympFormer: Accelerated attention blocks via Inertial Dynamics on Density Manifolds

SympFormer achieves faster convergence in attention blocks by drawing inspiration from inertial Nesterov acceleration, offering a potential speedup without additional computational cost.

Viktor Stein, Wuchen Li, Gabriele Steidl

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

University Grenoble Alpes2w ago·also SynapCell SAS, ZAC ISIPARC

SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding

By forcing a model to reconstruct aggressively masked EEG spectrograms, SpecMoE learns intricate neural patterns across both high- and low-frequency domains, leading to state-of-the-art cross-species EEG decoding.

D. Darankoum, C. Habermacher, J. Volle +1

Architecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug Design Speech & Audio

Ltd./AI Open Research Lab2w ago

DynamicGate MLP Conditional Computation via Learned Structural Dropout and Input Dependent Gating for Functional Plasticity

DynamicGate-MLP learns to selectively activate MLP units based on the input, achieving better compute efficiency without sacrificing performance.

Yong Il Choi

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Yifan Zhang2w ago

Residual Stream Duality in Modern Transformer Architectures

Transformers have a hidden symmetry: depth-wise residuals are secretly doing the same thing as sequence-wise sliding window attention, unlocking new architectural insights.

Yifan Zhang

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

2w ago

An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU

Fine-tune 123B+ parameter models on a single RTX 4090 with SlideFormer, a system that achieves up to 6x larger models and 8x larger batch sizes.

Ruijia Yang, Zeyi Wen

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Suvajit Patra +12w ago

STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

Achieve state-of-the-art performance in continuous sign language recognition with 70-80% fewer parameters by unifying spatial and temporal attention.

Suvajit Patra, Soumitra Samanta

Architecture Design (Transformers, SSMs, MoE)Computer Vision Natural Language Processing

Junyi Liu +52w ago

A Scalable Open-Source QEC System with Sub-Microsecond Decoding-Feedback Latency

Achieve sub-microsecond decoding-feedback latency in a scalable, open-source QEC system, bringing fault-tolerant quantum computation closer to reality.

Junyi Liu, Yi-Che Lee, Yi Lee +3

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Open-Source Models & Weights

Rina Veler +12w ago

Speakers Localization Using Batch EM In Unfolding Neural Network

Unfolding the EM algorithm into a neural network yields a speaker localization method that's more robust and accurate than traditional Batch-EM, especially in challenging acoustic conditions.

Rina Veler, Sharon Gannot

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

2w ago

Deep Learning-Driven Black-Box Doherty Power Amplifier with Pixelated Output Combiner and Extended Efficiency Range

Deep learning slashes design time for high-efficiency Doherty power amplifiers, enabling complex pixelated combiners that extend the back-off efficiency range.

Han Zhou, Haojie Chang, David Widén +1

Architecture Design (Transformers, SSMs, MoE)Scientific Discovery & Drug Design Training Efficiency & Optimization

Search

Architecture Design (Transformers, SSMs, MoE) - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (100)