Microsoft ResearchAI4BharatIIT MadrasKUNilekani Centre at AIFeb 25, 2026arXiv:2602.22125

IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages

Thanmay Jayakumar, Thanmay Jayakumar, Mohammed Safi Ur Rahman Khan, Mohammed Safi Ur Rahman Khan, Raj Dabre, Raj Dabre, Ratish Puduppully, Ratish Puduppully, Anoop Kunchukuttan, Anoop Kunchukuttan

AI Summary

The paper introduces IndicIFEval, a new benchmark for evaluating instruction-following capabilities of LLMs in 14 Indic languages using automatically verifiable, rule-based instructions. The benchmark consists of two subsets: translated prompts from IFEval localized for Indic contexts and synthetically generated instructions grounded in native Indic content, with approximately 800 human-verified examples per language. Experiments with open-weight and proprietary models reveal that while models adhere to formatting constraints, they struggle with lexical and cross-lingual tasks, particularly in lower-resource Indic languages.

Key Contribution

LLMs struggle with instruction following in Indic languages despite progress in high-resource languages, as shown by a new benchmark spanning 14 languages.

Abstract

Instruction-following benchmarks remain predominantly English-centric, leaving a critical evaluation gap for the hundreds of millions of Indic language speakers. We introduce IndicIFEval, a benchmark evaluating constrained generation of LLMs across 14 Indic languages using automatically verifiable, rule-based instructions. It comprises around 800 human-verified examples per language spread across two complementary subsets: IndicIFEval-Ground, translated prompts from IFEval (Zhou et al., 2023) carefully localized for Indic contexts, and IndicIFEval-Ground, synthetically generated instructions grounded in native Indic content. We conduct a comprehensive evaluation of major open-weight and proprietary models spanning both reasoning and non-reasoning models. While models maintain strong adherence to formatting constraints, they struggle significantly with lexical and cross-lingual tasks -- and despite progress in high-resource languages, instruction-following across the broader Indic family lags significantly behind English. We release IndicIFEval and its evaluation scripts to support progress on multilingual constrained generation (http://github.com/ai4bharat/IndicIFEval).

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References22

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages

Related Papers