UIUCMay 28, 2026arXiv:2605.29648

Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

Shichen Fan, Shicheng Fan, Haochang Hao, Dehai Min, Weihao Liu, Philip S. Yu, Lu Cheng

AI Summary

This paper introduces CorVer, a lightweight, corpus-grounded reward function for reinforcement learning that uses Wikipedia co-occurrence statistics to provide sentence-level feedback for factual question answering. By aligning sentence-level credit to token-level advantages, CorVer overcomes the limitations of coarse response-level rewards and unreliable neural verifiers. Experiments across various models and QA benchmarks demonstrate that CorVer consistently improves performance over baselines and outperforms neural verifiers while being significantly faster to train.

Key Contribution

Forget slow, expensive neural verifiers: this work shows a simple corpus lookup can provide faster, better rewards for RL fine-tuning of QA models.

Abstract

Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace. Sentence-level alternatives offer finer-grained feedback, but typically rely on NLI verifiers, LLM judges, or knowledge-verification pipelines that are expensive to deploy at RL scale and often unreliable for rare-entity facts, where accurate reward signals are especially important. We propose CorVer (Corpus Verify), a lightweight, plug-in-ready process reward that replaces neural verifiers with a corpus-grounded signal derived from Wikipedia co-occurrence statistics. CorVer assigns sentence-level credit and maps it to token-level advantages via a simple alignment, requiring only a 0.5B extractor and a single corpus lookup per sentence. Across 30 (model, benchmark) cells spanning six instruction-tuned models (3B to 14B) and five QA benchmarks, CorVer improves over the raw baseline for every cell, with an average TriviaQA gain of +4.1 pp. It also outperforms four neural-verifier baselines in 18 of 20 cells under their feasible configurations, while training 4.8 to 8.4x faster.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References26

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

Related Papers