Search papers, labs, and topics across Lattice.
The paper addresses the problem of fusing heterogeneous retrieval scores from dense vector similarity and graph-based relevance signals in multi-hop question answering. They propose PhaseGraph, a method that uses percentile-rank normalization (PIT) to map vector and graph scores to a common scale before fusion. Experiments on MuSiQue and 2WikiMultiHopQA show that this calibrated fusion improves held-out last-hop retrieval performance compared to uncalibrated fusion.
Stop letting mismatched score distributions sink your multi-hop QA: calibrating vector and graph retrieval scores with percentile-rank normalization yields statistically significant gains.
Graph-augmented retrieval combines dense similarity with graph-based relevance signals such as Personalized PageRank (PPR), but these scores have different distributions and are not directly comparable. We study this as a score calibration problem for heterogeneous retrieval fusion in multi-hop question answering. Our method, PhaseGraph, maps vector and graph scores to a common unit-free scale using percentile-rank normalization (PIT) before fusion, enabling stable combination without discarding magnitude information. Across MuSiQue and 2WikiMultiHopQA, calibrated fusion improves held-out last-hop retrieval on HippoRAG2-style benchmarks: LastHop@5 increases from 75.1% to 76.5% on MuSiQue (8W/1L, p=0.039) and from 51.7% to 53.6% on 2WikiMultiHopQA (11W/2L, p=0.023), both on independent held-out test splits. A theory-driven ablation shows that percentile-based calibration is directionally more robust than min-max normalization on both tune and test splits (1W/6L, p=0.125), while Boltzmann weighting performs comparably to linear fusion after calibration (0W/3L, p=0.25). These results suggest that score commensuration is a robust design choice, and the exact post-calibration operator appears to matter less on these benchmarks.