Search papers, labs, and topics across Lattice.
The paper introduces DivanBench, a new benchmark to evaluate Persian Language Models (LLMs) on their ability to reason about cultural norms and superstitions, going beyond simple factual recall. The benchmark reveals that current Persian LLMs exhibit significant acquiescence bias, struggling to reject violations of cultural norms even when they can identify appropriate behaviors. Furthermore, the study finds that continuous pretraining on Persian data exacerbates this bias and that there is a substantial performance gap between factual retrieval and applying knowledge in context, highlighting the limitations of simply scaling monolingual data for cultural competence.
Persian LLMs can parrot cultural norms but fail spectacularly at applying them, revealing a stark gap between factual recall and conceptual reasoning in cultural contexts.
While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DivanBench, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Persian LLMs and reveal three critical failures: most models exhibit severe acquiescence bias, correctly identifying appropriate behaviors but failing to reject clear violations; continuous Persian pretraining amplifies this bias rather than improving reasoning, often degrading the model's ability to discern contradictions; and all models show a 21\% performance gap between retrieving factual knowledge and applying it in scenarios. These findings demonstrate that cultural competence requires more than scaling monolingual data, as current models learn to mimic cultural patterns without internalizing the underlying schemas.