Search papers, labs, and topics across Lattice.
This paper benchmarks the hallucination rates of four popular LLMs (ChatGPT, Grok, Gemini, and Copilot) when used for academic writing tasks across reference generation, factual explanation, abstract generation, and writing improvement. They introduce a "Hallucination Index" (HI) to quantify hallucination across these tasks, evaluating models on factual accuracy, reference validity, coherence, style consistency, and academic tone. Results show that no single model excels across all tasks, with Grok and Copilot better at reference generation but struggling with abstract/stylistic prompts, while Gemini and ChatGPT exhibit stronger tone control but higher hallucination risk in factual tasks.
LLMs may sound convincing when writing academic content, but they can still confidently fabricate facts and references at surprisingly high rates.
Large Language models (LLMs) show extraordinary abilities, but they are still prone to hallucinations, especially when we use them for generating Academic content. We have investigated four popular LLMs, ChatGPT, Grok, Gemini, and Copilot for hallucinations specifically for academic writing. We have designed 80 prompts across four categories, namely, reference generation, factual explanation, abstract generation, and writing improvement. We evaluated the model using a 0-5 rubric score, which checks factual accuracy, reference validity, coherence, style consistency, and academic tone. A novel weighted metric, Hallucination Index (HI), was introduced to measure hallucination in the responses generated by the models. Some of the most widely used evaluation metrics often fail to check errors which alter sentiment in machine-translated text. We found that Grok and Copilot perform better on reference generation tasks, but they often struggle with abstract or stylistic prompts, with HI values of 0.67 and 0.70, respectively. Whereas, Gemini and ChatGPT have done well with having stronger tone control, but they lack in writing factual tasks and higher hallucination risk with HI scores of 0.53 and 0.57, respectively. Our study found that hallucination behavior does not depend solely on model architecture but also on the type of task and the prompting conditions we are providing. We propose that our work opens new research dimensions for future researchers.