Search papers, labs, and topics across Lattice.
The paper introduces BEAT, a novel music tokenization method that uses uniform-length temporal steps (beats) as the basic unit, encoding all events within a time step as a single token. This approach contrasts with existing methods that tokenize music as sequences of musical events with variable durations. Experiments on music continuation and accompaniment generation demonstrate that BEAT improves musical quality, structural coherence, and long-range pattern capture compared to event-based tokenization methods.
Forget complex event sequences: tokenizing music by uniform temporal beats unlocks better musical quality and structural coherence in generated music.
Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformer-based models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non-uniform time progression. In this paper, we instead consider whether an alternative tokenization is possible, where a uniform-length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano-roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event-based methods. Results show improved musical quality and structural coherence, while additional analyses confirm higher efficiency and more effective capture of long-range patterns with the proposed tokenization.