Search papers, labs, and topics across Lattice.
This paper establishes tight bounds on the VC dimension and sample complexity of depth-$L$ Transformers, revealing an upper bound of $O(L W \log (T W))$ and a lower bound of $惟(L W \log (T W / L))$ for their capacity to learn from input sequences. It also analyzes the sample complexity associated with chain-of-thought learning, demonstrating that teacher forcing can achieve a sample complexity of $O\left(L W \log \left(\left(T+T^{\prime}\right) W\right)\right)$. These results provide critical insights into the efficiency and limitations of Transformers in learning tasks, particularly in the context of chain-of-thought reasoning.
Transformers require a surprisingly high number of examples for effective chain-of-thought learning, challenging assumptions about their efficiency.
We tightly characterize the VC dimension of depth-$L$ Transformers with a total of $W$ parameters, mapping an input sequence of length $T$ to a single output, establishing an upper bound of $O(L W \log (T W))$ and a nearly matching lower bound of $惟(L W \log (T W / L))$. We further tightly characterize the sample complexity of chain-of-thought learning using such a Transformer, showing teacher forcing (i.e. selecting a predictor consistent with the entire chain-of-thought on training data) learns with sample complexity $O\left(L W \log \left(\left(T+T^{\prime}\right) W\right)\right)$ and that any learning rule that uses chain-of-thought data requires at least $惟\left(L W \log \left(\left(T+T^{\prime}\right) W / L\right)\right)$ examples, where $T$ is the input length and $T^{\prime}$ is the number of autoregressive steps.