Ben Athiwaratkun

Papers on Lattice

Total citations

Topics

h-index

Publication activitypapers/week, last 8 weeks

Research focus

Inference & Quantization (3)Architecture Design (Transformers, SSMs, MoE) (3)Training Efficiency & Optimization (3)Distributed Systems & Hardware (1)

Frequent co-authors

Zhongzhu Zhou (3)Tri Dao (3)Shuaiwen Leon Song (3)Junxiong Wang (3)

Papers (5)

Apr 21, 2026

Jinda Jia +104d ago

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

Forget fancy quantization schemes – a simple token-wise INT4 quantization with Hadamard rotation is all you need to nearly match FP16 accuracy in LLM serving, without sacrificing throughput.

Jinda Jia, Jisen Li, Zhongzhu Zhou +8

Distributed Systems & Hardware Inference & Quantization

Apr 13, 2026

1w ago·also Princeton, Together, UT Austin

Introspective Diffusion Language Models

Diffusion language models can now match autoregressive quality, thanks to a clever trick that forces them to agree with themselves.

Yifan Yu, Yuqing Jian, Junxiong Wang +12

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Apr 9, 2026

BAIR2w ago·also Together

Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution

Verifier-free evolution can now match or exceed the performance of verifier-based methods, while slashing API costs by 3x and boosting throughput by 10x, thanks to a clever model orchestration strategy.

Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou +17

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Mar 18, 2026

Fengxiang Bie +6Mar 18, 2026

CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

Forget SVD: CARE aligns low-rank attention approximations with input activations, boosting accuracy up to 1.7x and slashing perplexity by 215x when converting models to multi-head latent attention.

Fengxiang Bie, Ziyan Chen, Yibo Yang +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Mar 4, 2026

BAIRMar 4, 2026·also NVIDIA, K-frame

$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners

Models are substantially better at pairwise self-verification than independent scoring, unlocking a more efficient and accurate approach to test-time scaling for complex reasoning.

Harman Singh, Xiuyu Li, Kusha Sareen +14

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Search

Ben Athiwaratkun

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (5)