Search papers, labs, and topics across Lattice.
The paper introduces an evaluation framework for LLMs based on the Interagency Language Roundtable (ILR) Skill Level Descriptions and applies it to Claude (Sonnet 4.6) across six languages using 12 semantically equivalent prompt clusters. The study combines automated quantitative metrics with expert ILR qualitative assessment to analyze 216 responses. Results reveal systematic cross-lingual variations in response length, pragmatic disambiguation, aesthetic divergence, technical terminology, cultural calibration, and institutional referral behavior, highlighting the importance of ILR-informed expert judgment for comprehensive LLM evaluation.
LLMs exhibit surprising cross-lingual inconsistencies beyond simple translation errors, revealing divergences in cultural calibration, pragmatic disambiguation, and even institutional referral behavior.
This paper introduces a systematic evaluation framework grounded in the Interagency Language Roundtable (ILR) Skill Level Descriptions and applies it to Claude (Sonnet 4.6) across six languages: English, French, Romanian, Spanish, Italian, and German. We administer a battery of 12 semantically equivalent prompt clusters spanning ILR complexity levels 1 through 3+, collect 216 responses (12 prompts, 6 languages, 3 runs), and analyze outputs through a two-layer methodology combining automated quantitative metrics with expert ILR qualitative assessment. Quantitative analysis reveals that French responses are approximately 30% longer than German responses on identical prompts, and that creative and affective clusters show the highest cross-lingual surface divergence. Qualitative analysis, conducted by a six-language professional with 12 years of ILR/OPI assessment experience, identifies five cross-lingual variation patterns: systematic differences in pragmatic disambiguation strategies, aesthetic and literary tradition divergence in creative output, language-internal technical terminology norms, cultural calibration gaps evidenced by the absence of culture-specific content in favor of culturally neutralized templates, and language-specific institutional referral behavior in emotional support responses. We argue that ILR-informed expert judgment applied to LLM outputs constitutes a novel and underreported evaluation methodology that complements purely computational benchmarks, and that cross-lingual output variation in Claude is interpretable, domain-dependent, and consequential for equitable multilingual AI deployment.