Mila

×Architecture Design (Transformers, SSMs, MoE)

10 papers from Mila on Architecture Design (Transformers, SSMs, MoE)

Apr 29, 2026

D sequence? Across the small3w ago·also BAIR, Mila, ×4, UC Santa Cruz +1

When 2D Tasks Meet 1D Serialization: On Serialization Friction in Structured Tasks

LLMs struggle with structured 2D tasks when inputs are serialized into 1D, revealing a surprising performance gap compared to vision-augmented models that directly process the 2D layout.

Chung-Hsiang Lo, Lu Li, Diji Yang +4

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Reasoning & Chain-of-Thought

Apr 27, 2026

Mila3w ago·also Capital One

Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models

LLMs re-rank documents better when you learn to route each query to the specific attention heads that matter, instead of relying on static subsets or everything at once.

Yuxing Tian, Fengran Mo, Zhiqi Huang +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Recommendation & Information Retrieval

Apr 13, 2026

MilaApr 13, 2026

A Mechanistic Analysis of Looped Reasoning Language Models

Looped LLMs don't just perform better reasoning, they also internally mirror the distinct inference stages of standard feedforward models, repeating them cyclically.

Hugh Blayney, Álvaro Arroyo, Johan Obando-Ceron +4

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Apr 6, 2026

MilaApr 6, 2026·also AI Center, B-Instruct-2507. Here, McGill

REAM: Merging Improves Pruning of Experts in LLMs

Merging experts in MoE LLMs can actually *improve* performance compared to pruning, offering a new path to compression that preserves capabilities.

Saurav Jha, Maryam Hashemzadeh, M. Hashemzadeh +5

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

Apr 1, 2026

MilaApr 1, 2026·also UdeM

Self-Routing: Parameter-Free Expert Routing from Hidden States

MoEs don't always need learned routers: routing information can be embedded directly in the hidden state.

J. Mohamud, D. Wagner, M. Ravanelli

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Mar 5, 2026

MilaMar 5, 2026

WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation

Ditch the text: WavSLM shows you can train a competitive speech language model using only distilled WavLM representations, unlocking a simpler, single-stream generative pretraining paradigm for speech.

Luca Della Libera, Cem Subakan, M. Ravanelli +1

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Speech & Audio

Mar 2, 2026

MilaMar 2, 2026·also AI Institute, McGill, Poly Montreal, School of Computer Science

The Expressive Limits of Diagonal SSMs for State-Tracking

Diagonal SSMs, despite their empirical success, provably fail to track states of non-Abelian groups, revealing fundamental limitations in their expressive power.

Behnoush Khavari, Sarath Chandar

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing

Feb 26, 2026

MilaFeb 26, 2026·also Institute of Science Tokyo, RIKEN, Supercomputing Research Center, UTokyo

Takeuchi's Information Criteria as Generalization Measures for DNNs Close to NTK Regime

Takeuchi's Information Criterion (TIC) accurately predicts DNN generalization gaps, but only when models operate near the Neural Tangent Kernel (NTK) regime.

Hiroki Naganuma, Hiroki Naganuma, Taiji Suzuki +7

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Feb 23, 2026

MilaFeb 23, 2026·also IDEA

ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting

Attention-based re-ranking gets a boost: ReAttn's post-hoc re-weighting tames over-concentration and lexical bias, leading to more accurate and interpretable results without extra training.

Yuxing Tian, Fengran Mo, Weixu Zhang +2

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Recommendation & Information Retrieval

May 22, 2025

MilaMay 22, 2025·also Amgen, Chandar Research Lab, Poly Montreal

Structure-Aligned Protein Language Model

Dramatically improve protein language models by simply post-training them to align with protein graphs, yielding a 59% increase in contact prediction accuracy.

Can Chen, David Heurtel-Depeiges, Robert M. Vernon +3

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Scientific Discovery & Drug Design

Search

Mila