Tencent AIFeb 26, 2026arXiv:2602.22584

Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA

Wenwei Li, Wenwei Li, Mingde Xu, Ming Xu, Tianle Xia, Tianle Xia, Lingxiang Hu, Yiding Sun, Linfang Shang, Linfang Shang, Liqun Liu, Liqun Liu, Peng Shu, Peng Shu, Huan Yu, Jie Jiang, Jie Jiang

AI Summary

The paper addresses the challenge of hallucination in industrial advertising QA systems by proposing a reinforced co-adaptation framework for RAG. This framework incorporates GraphRAG, a graph-aware retrieval mechanism that leverages entity-relation structures for multi-hop evidence selection, and evidence-constrained reinforcement learning using Group Relative Policy Optimization (GRPO) with multi-dimensional rewards. Experiments on an internal dataset and a two-week A/B test demonstrate significant improvements in accuracy, completeness, safety, and a substantial reduction in URL hallucination, leading to improved user engagement.

Key Contribution

Dramatically reduce hallucination in industrial RAG systems by jointly optimizing retrieval and generation with graph-aware retrieval and reinforcement learning, leading to a 92.7% reduction in URL hallucination in a real-world advertising QA system.

Abstract

Industrial advertising question answering (QA) is a high-stakes task in which hallucinated content, particularly fabricated URLs, can lead to financial loss, compliance violations, and legal risk. Although Retrieval-Augmented Generation (RAG) is widely adopted, deploying it in production remains challenging because industrial knowledge is inherently relational, frequently updated, and insufficiently aligned with generation objectives. We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for multi-hop, domain-specific evidence selection; and (2) evidence-constrained reinforcement learning via Group Relative Policy Optimization (GRPO) with multi-dimensional rewards covering faithfulness, style compliance, safety, and URL validity. Experiments on an internal advertising QA dataset show consistent gains across expert-judged dimensions including accuracy, completeness, and safety, while reducing the hallucination rate by 72\%. A two-week online A/B test demonstrates a 28.6\% increase in like rate, a 46.2\% decrease in dislike rate, and a 92.7\% reduction in URL hallucination. The system has been running in production for over half a year and has served millions of QA interactions.

Natural Language Processing Recommendation & Information Retrieval RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References34

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA

Related Papers