Shawqi Al-Maliki

Hamad Bin Khalifa University Abstract Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs. Our dual-agent framework uses a quantitative agent for citation verification and six-dimensional scoring (e.g., Structure, Islamic Consistency, Citations) and a qualitative agent for five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality). GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong performance, models still fall short in reliably producing accurate Islamic content and citations—a paramount requirement in faith-sensitive writing. GPT-4o had the highest mean quantitative score (3.90/5), while Ansari AI led qualitative pairwise wins (116/200). Fanar, though trailing, introduces innovations for Islamic and Arabic contexts. This study underscores the need for community-driven benchmarks centering Muslim perspectives, offering an early step toward more reliable AI in Islamic knowledge and other high-stakes domains such as medicine, law, and journalism. 1 Introduction Islamic content generation demands theological accuracy, stylistic reverence, and precise attribution, as minor errors, misquoting Qur’anic verses, misattributing Hadiths, or using inappropriate tone, can propagate misinformation and cause spiritual or physical harm [1]. While modern large language models (LLMs) achieve strong fluency across domains, their reliability drops in high-stakes contexts [2], and conventional metrics like BLEU or ROUGE [3] capture only surface overlap, failing to assess authenticity, citation integrity, or theological correctness [4]. Domain-specific evaluations for high-stakes domains such as medicine and law [5, 6, 7] exist, but religious pipelines remain lacking. In Islamic natural language processing (NLP), systems like Ansari AI, a GPT-4o/Claude chatbot with Qur’anic & Hadith retrieval [8], and Fanar, a Qatar-based RAG-driven LLM [9], show promise, yet evaluations are limited to general Arabic benchmarks (Arabic-SQuAD [10], MLQA [11], TyDiQA [12], Arabic MMLU [13]) that mostly test linguistic aspects rather than theological grounding. Further, in terms of infrastructure, many classical texts remain unstructured PDFs or scanned images, hindering computational usage. Agent-based LLMs that integrate retrieval [14], planning [15, 16], and multi-agent collaboration [17, 18, 19, 20] improve grounding and verifiability, yet no pipeline unifies theological verification with stylistic evaluation for Islamic content. We ask: Can current LLMs generate faithful Islamic content that is theologically accurate, properly attributed, and respectfully expressed, and how can this be systematically evaluated? To address this, we propose “Can LLMs Write Faithfully?”, a dual-agent framework linking outputs to reference-level verifications for explainable assessment across theological and stylistic dimensions. Applied to GPT-4o, Ansari AI, and Fanar on 50 carefully selected prompts derived from titles of blogs authored by Islamic scholars and collected from authentic Islamic blog sites, it establishes one of the first systematic studies of Islamically faithful text generation. The framework is modular and interpretable, providing a blueprint adaptable to other high-stakes domains such as medicine, law, and journalism. Figure 1: Illustration of System Design and Methodology of the proposed Dual-Agent framework for LLM-generated Islamic content verification, both quantitatively and qualitatively. 2 Literature Review Evaluation Challenges in High-Stakes Domains. Work on LLM-generated religious content spans domain-specific evaluation, Islamic NLP, and tool-augmented verification, and faces challenges similar to other high-stakes fields requiring truthfulness, appropriate tone, and correct sourcing. In law, the Mata v. Avianca case exposed fabricated authorities [21], and general chatbots show hallucination rates of 58–82% on legal questions [22]. RAG-backed tools improve grounding yet still make errors at notable rates (over 17% for Lexis+ AI and Thomson Reuters’ Practical Law; over 34% for Westlaw) [23]. Scholars further distinguish between factual errors and misattributions, the latter closely paralleling misquotation or the misapplication of Qur’anic verses and Hadith in Islamic writing. In medicine, SourceCheckup [24] found that 50–90% of responses are not fully supported by their own citations, and even GPT-4 with RAG had 30% unsupported statements, and nearly half of the answers were not fully supported. Journalism has seen comparable failures: CNET corrected 41 of 77 AI-written finance articles [25], leading outlets to mandate human fact-checking and restrict AI to assistive roles [26]. Theological education reports related risks; the NEXUS (2024) study documents fabricated biblical citations and recommends supervised use, transparent citation protocols, and clear separation of canonical sources from AI-generated material [27]. Advances and Gaps in Islamic NLP. Islamic NLP has progressed in Qur’an verse retrieval, Hadith classification, and dialect identification [28], underpinned by foundational work on Arabic morphology and orthography [29]. Pretrained models (AraBERT [30]), benchmarks (Qur’anQA [31]), and new tooling for multimodal data acquisition from authentic sources [32] have advanced Arabic understanding. Islamic chatbots such as Ansari AI and Fanar [8, 9] show pedagogical promise but prioritize conversational fluency over rigorous verification of citations and doctrinal soundness. In parallel, Islamic AI ethics calls for moral accountability and human oversight [33, 34]. Interdisciplinary work highlights infrastructural barriers: under-digitized, unstructured, fragmented corpora that impede robust training and evaluation [35]. Platforms like Usul.ai, SHARIAsource, and CAMeL Lab [36, 37, 38] point toward machine-actionable Islamic legal data, often leveraging corpora such as Shamela and OpenITI [39, 40]. Yet the extent and quality of their inclusion in general LLM pretraining remain uncertain, and they are not systematically integrated into evaluation pipelines for frontier models, motivating intermediate frameworks that do not assume perfect corpora but still enforce checks on theological accuracy, stylistic propriety, and citation integrity. Tool-Augmented and Multi-Agent Approaches. Concurrently, tool-augmented agents combine retrieval-augmented generation [14], chain-of-thought prompting [15], and multi-agent coordination frameworks such as LangChain, CamelAI, OpenAI Agents, CrewAI, and Tree-of-Thought [19, 17, 20, 18, 16]. These architectures improve grounding in general tasks but are rarely tuned for the verification demands and stylistic norms of theological writing, where misquotation carries distinctive ethical and cultural consequences. Standard metrics like BLEU and ROUGE [3] capture surface overlap but miss doctrinal fidelity and respectful tone. Holistic and expert-in-the-loop evaluations offer stronger templates: composite quality metrics [4], and human feedback pipelines in medical and legal NLP that combine expert judgment with automated scoring [5, 6]. 3 Methodology 3.1 Prompt and Response Collection We collected 50 prompts from titles of blogs authored by recognized Islamic scholars across reputable platforms: The Thinking Muslim, IslamOnline, Yaqeen Institute, SeekersGuidance, and UlumalHadith. Prompts cover five domains: Jurisprudence (Fiqh), Qur’anic Exegesis (Tafsir), Hadith Sciences (Ulum al-Hadith), Theology (Aqidah), and Spiritual Conduct (Adab), ensuring thematic diversity. Each prompt used the template: “Write a blog-style essay on the following topic: [TITLE HERE] The response should be thorough, clear, and well-organized, aimed at a general audience, including reflections, reasoning, and examples where relevant.”

OpenAI

Papers on Lattice

Total citations

Topics

h-index

Research focus

Constitutional AI & AI Ethics (1)Eval Frameworks & Benchmarks (1)Natural Language Processing (1)

Frequent co-authors

Abdullah Mushtaq (1)M. Naeem (1)Ezieddin Elmahjub (1)Ibrahim Ghaznavi (1)

Papers (1)

Oct 28, 2025

Information Technology UniversityOct 28, 2025·also OpenAI, Qatar University

Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content

LLMs still struggle to reliably produce accurate Islamic content and citations, despite relatively strong performance, revealing a critical gap in faith-sensitive AI writing.

Abdullah Mushtaq, M. Naeem, Ezieddin Elmahjub +5

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Search

Shawqi Al-Maliki

Research focus

Frequent co-authors

Papers (1)