OpenAIInformation Technology UniversityQatar UniversityOct 28, 2025arXiv:2510.24438

Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content

Abdullah Mushtaq, M. Naeem, Ezieddin Elmahjub, Ibrahim Ghaznavi, Shawqi Al-Maliki, Mohamed M. Abdallah, Ala I. Al-Fuqaha, Junaid Qadir

AI Summary

This paper evaluates the ability of GPT-4o, Ansari AI, and Fanar to generate faithful Islamic content using a novel dual-agent framework with quantitative citation verification and qualitative side-by-side comparisons. The framework assesses models across dimensions like Islamic accuracy, citation quality, and structural integrity, using prompts derived from authentic Islamic blogs. Results show that while GPT-4o achieves the highest quantitative scores, all models struggle with reliably producing accurate Islamic content and citations, highlighting the need for community-driven benchmarks.

Key Contribution

LLMs still struggle to reliably produce accurate Islamic content and citations, despite relatively strong performance, revealing a critical gap in faith-sensitive AI writing.

Abstract

Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs. Our dual-agent framework uses a quantitative agent for citation verification and six-dimensional scoring (e.g., Structure, Islamic Consistency, Citations) and a qualitative agent for five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality). GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong performance, models still fall short in reliably producing accurate Islamic content and citations -- a paramount requirement in faith-sensitive writing. GPT-4o had the highest mean quantitative score (3.90/5), while Ansari AI led qualitative pairwise wins (116/200). Fanar, though trailing, introduces innovations for Islamic and Arabic contexts. This study underscores the need for community-driven benchmarks centering Muslim perspectives, offering an early step toward more reliable AI in Islamic knowledge and other high-stakes domains such as medicine, law, and journalism.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References40

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content

Related Papers