Search papers, labs, and topics across Lattice.
This paper introduces a token-level perplexity framework to analyze whether LLMs rely on linguistically relevant cues, contrasting benchmark performance with underlying mechanisms. The method compares perplexity distributions over minimal sentence pairs differing in pivotal tokens to test specific linguistic hypotheses. Experiments on controlled benchmarks reveal that LLMs' perplexity shifts are not fully explained by linguistically important tokens, indicating a reliance on unexpected heuristics.
LLMs ace linguistic benchmarks, but a token-level perplexity analysis reveals they're often relying on the wrong cues.
Standard evaluations of Large language models (LLMs) focus on task performance, offering limited insight into whether correct behavior reflects appropriate underlying mechanisms and risking confirmation bias. We introduce a simple, principled interpretability framework based on token-level perplexity to test whether models rely on linguistically relevant cues. By comparing perplexity distributions over minimal sentence pairs differing in one or a few `pivotal'tokens, our method enables precise, hypothesis-driven analysis without relying on unstable feature-attribution techniques. Experiments on controlled linguistic benchmarks with several open-weight LLMs show that, while linguistically important tokens influence model behavior, they never fully explain perplexity shifts, revealing that models rely on heuristics other than the expected linguistic ones.