Search papers, labs, and topics across Lattice.
10 papers from Mila on Architecture Design (Transformers, SSMs, MoE)
LLMs struggle with structured 2D tasks when inputs are serialized into 1D, revealing a surprising performance gap compared to vision-augmented models that directly process the 2D layout.
LLMs re-rank documents better when you learn to route each query to the specific attention heads that matter, instead of relying on static subsets or everything at once.
Looped LLMs don't just perform better reasoning, they also internally mirror the distinct inference stages of standard feedforward models, repeating them cyclically.
Merging experts in MoE LLMs can actually *improve* performance compared to pruning, offering a new path to compression that preserves capabilities.
MoEs don't always need learned routers: routing information can be embedded directly in the hidden state.
Ditch the text: WavSLM shows you can train a competitive speech language model using only distilled WavLM representations, unlocking a simpler, single-stream generative pretraining paradigm for speech.
Diagonal SSMs, despite their empirical success, provably fail to track states of non-Abelian groups, revealing fundamental limitations in their expressive power.
Takeuchi's Information Criterion (TIC) accurately predicts DNN generalization gaps, but only when models operate near the Neural Tangent Kernel (NTK) regime.
Attention-based re-ranking gets a boost: ReAttn's post-hoc re-weighting tames over-concentration and lexical bias, leading to more accurate and interpretable results without extra training.
Dramatically improve protein language models by simply post-training them to align with protein graphs, yielding a 59% increase in contact prediction accuracy.