DaejeonJungang Cheonggua Co.Mar 17, 2026arXiv:2603.16244

More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification

AI Summary

This paper investigates whether multi-turn review, a natural extension of Cross-Context Review (CCR) for LLM verification, improves error detection. Through a controlled experiment with 30 artifacts and 150 injected errors, the authors found that multi-turn Dynamic CCR (D-CCR) variants consistently underperformed single-pass CCR due to increased false positives driven by "false positive pressure" and "review target drift." Independent re-review without context performed worst, indicating that repetition alone degrades performance.

Key Contribution

Multi-turn review actually *worsens* LLM verification compared to single-pass review, as reviewers fabricate findings and critique the conversation itself rather than the artifact.

Abstract

Cross-Context Review (CCR) improves LLM verification by separating production and review into independent sessions. A natural extension is multi-turn review: letting the reviewer ask follow-up questions, receive author responses, and review again. We call this Dynamic Cross-Context Review (D-CCR). In a controlled experiment with 30 artifacts and 150 injected errors, we tested four D-CCR variants against the single-pass CCR baseline. Single-pass CCR (F1 = 0.376) significantly outperformed all multi-turn variants, including D-CCR-2b with question-and-answer exchange (F1 = 0.303, $p < 0.001$, $d = -0.59$). Multi-turn review increased recall (+0.08) but generated 62% more false positives (8.5 vs. 5.2), collapsing precision from 0.30 to 0.20. Two mechanisms drive this degradation: (1) false positive pressure -- reviewers in later rounds fabricate findings when the artifact's real errors have been exhausted, and (2) Review Target Drift -- reviewers provided with prior Q&A exchanges shift from reviewing the artifact to critiquing the conversation itself. Independent re-review without prior context (D-CCR-2c) performed worst (F1 = 0.263), confirming that mere repetition degrades rather than helps. The degradation stems from false positive pressure in additional rounds, not from information amount -- within multi-turn conditions, more information actually helps (D-CCR-2b > D-CCR-2a). The problem is not what the reviewer sees, but that reviewing again invites noise.

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification

Related Papers