Search papers, labs, and topics across Lattice.
The paper introduces LiveFact, a dynamic benchmark for fake news detection that addresses limitations of static benchmarks like benchmark data contamination (BDC) and inability to assess temporal reasoning. LiveFact uses continuously updated, time-sensitive evidence sets to simulate real-world misinformation scenarios and evaluates models in both classification and inference modes. Experiments with 22 LLMs reveal a "reasoning gap" where models struggle with early, incomplete information, highlighting the importance of epistemic humility in AI verification.
Open-source Mixture-of-Experts models now rival proprietary systems in fake news detection, but LiveFact reveals a critical "reasoning gap" where all models struggle with the temporal uncertainty inherent in real-world misinformation.
The rapid development of Large Language Models (LLMs) has transformed fake news detection and fact-checking tasks from simple classification to complex reasoning. However, evaluation frameworks have not kept pace. Current benchmarks are static, making them vulnerable to benchmark data contamination (BDC) and ineffective at assessing reasoning under temporal uncertainty. To address this, we introduce LiveFact a continuously updated benchmark that simulates the real-world "fog of war" in misinformation detection. LiveFact uses dynamic, temporal evidence sets to evaluate models on their ability to reason with evolving, incomplete information rather than on memorized knowledge. We propose a dual-mode evaluation: Classification Mode for final verification and Inference Mode for evidence-based reasoning, along with a component to monitor BDC explicitly. Tests with 22 LLMs show that open-source Mixture-of-Experts models, such as Qwen3-235B-A22B, now match or outperform proprietary state-of-the-art systems. More importantly, our analysis finds a significant "reasoning gap." Capable models exhibit epistemic humility by recognizing unverifiable claims in early data slices-an aspect traditional static benchmarks overlook. LiveFact sets a sustainable standard for evaluating robust, temporally aware AI verification.