GeorgetownUniversity of TehranFeb 26, 2026arXiv:2602.22827

TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models

Reihaneh Iranmanesh, Reihaneh Iranmanesh, Saeedeh Davoudi, Saeedeh Davoudi, Pasha Abrishamchian, Pasha Abrishamchian, O. Frieder, Ophir Frieder, Nazli Goharian, Nazli Goharian

AI Summary

The paper introduces TARAZ, a new Persian short-answer question benchmark designed to evaluate the cultural competence of large language models (LLMs). To address limitations of existing benchmarks, the authors developed a Persian-specific evaluation framework that incorporates rule-based morphological normalization and a hybrid syntactic and semantic similarity module for robust soft-match scoring. Experiments on 15 LLMs demonstrate that the hybrid evaluation improves scoring consistency by 10% compared to exact-match baselines, highlighting the importance of capturing semantic nuances in Persian.

Key Contribution

LLMs struggle to grasp Persian cultural nuances, but a new benchmark combining morphological normalization with syntactic/semantic similarity scoring closes the gap.

Abstract

This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian. Existing Persian cultural benchmarks rely predominantly on multiple-choice formats and English-centric metrics that fail to capture Persian's morphological complexity and semantic nuance. Our framework introduces a Persian-specific short-answer evaluation that combines rule-based morphological normalization with a hybrid syntactic and semantic similarity module, enabling robust soft-match scoring beyond exact string overlap. Through systematic evaluation of 15 state-of-the-art open- and closed-source models, we demonstrate that our hybrid evaluation improves scoring consistency by +10% compared to exact-match baselines by capturing meaning that surface-level methods cannot detect. We publicly release our evaluation framework, providing the first standardized benchmark for measuring cultural understanding in Persian and establishing a reproducible foundation for cross-cultural LLM evaluation research.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References31

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models

Related Papers