CMU MLApr 6, 2026arXiv:2604.04704

IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation

Anjali Kantharuban, Aarohi Srivastava, Fahim Faisal, Orevaoghene Ahia, Antonios Anastasopoulos, David Chiang

AI Summary

The paper introduces IDIOLEX, a framework for learning sentence representations that capture stylistic and dialectal variations independently of semantic content, using supervision from sentence provenance and linguistic features. They train models on Arabic and Spanish dialects, demonstrating that the learned representations capture meaningful stylistic variation and transfer across domains for analysis and classification tasks. Furthermore, they show these representations can be used as training objectives for stylistically aligning language models.

Key Contribution

Style lives in a continuous vector space: IDIOLEX lets you represent and manipulate stylistic and dialectal variations in language, opening doors to style-aware LLMs.

Abstract

Existing sentence representations primarily encode what a sentence says, rather than how it is expressed, even though the latter is important for many applications. In contrast, we develop sentence representations that capture style and dialect, decoupled from semantic content. We call this the task of idiolectal representation learning. We introduce IDIOLEX, a framework for training models that combines supervision from a sentence's provenance with linguistic features of a sentence's content, to learn a continuous representation of each sentence's style and dialect. We evaluate the approach on dialects of both Arabic and Spanish. The learned representations capture meaningful variation and transfer across domains for analysis and classification. We further explore the use of these representations as training objectives for stylistically aligning language models. Our results suggest that jointly modeling individual and community-level variation provides a useful perspective for studying idiolect and supports downstream applications requiring sensitivity to stylistic differences, such as developing diverse and accessible LLMs.

Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation

Related Papers