TTICJun 8, 2026arXiv:2606.09731

Tight Sample Complexity of Transformers

Chenxiao Yang, Nathan Srebro, Zhiyuan Li

AI Summary

This paper establishes tight bounds on the VC dimension and sample complexity of depth-$L$ Transformers, revealing an upper bound of $O(L W \log (T W))$ and a lower bound of $Ω(L W \log (T W / L))$ for their capacity to learn from input sequences. It also analyzes the sample complexity associated with chain-of-thought learning, demonstrating that teacher forcing can achieve a sample complexity of $O\left(L W \log \left(\left(T+T^{\prime}\right) W\right)\right)$. These results provide critical insights into the efficiency and limitations of Transformers in learning tasks, particularly in the context of chain-of-thought reasoning.

Key Contribution

Transformers require a surprisingly high number of examples for effective chain-of-thought learning, challenging assumptions about their efficiency.

Abstract

We tightly characterize the VC dimension of depth-$L$ Transformers with a total of $W$ parameters, mapping an input sequence of length $T$ to a single output, establishing an upper bound of $O(L W \log (T W))$ and a nearly matching lower bound of $Ω(L W \log (T W / L))$. We further tightly characterize the sample complexity of chain-of-thought learning using such a Transformer, showing teacher forcing (i.e. selecting a predictor consistent with the entire chain-of-thought on training data) learns with sample complexity $O\left(L W \log \left(\left(T+T^{\prime}\right) W\right)\right)$ and that any learning rule that uses chain-of-thought data requires at least $Ω\left(L W \log \left(\left(T+T^{\prime}\right) W / L\right)\right)$ examples, where $T$ is the input length and $T^{\prime}$ is the number of autoregressive steps.

Reasoning & Chain-of-Thought Scaling Laws & Emergent Abilities

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Tight Sample Complexity of Transformers

Related Papers