Search papers, labs, and topics across Lattice.
This paper introduces Alper, a unified framework for entity resolution that iteratively refines a global entity graph by integrating graph propagation with LLM-based pairwise queries. Alper adaptively selects signals to maximize marginal gain under a query budget, addressing the limitations of traditional blocking-matching-clustering pipelines. Experiments on eight benchmark datasets demonstrate Alper's superior performance compared to state-of-the-art methods.
LLMs can drastically improve entity resolution accuracy, but only if you intelligently integrate them into a global graph refinement process that minimizes query costs.
Dirty entity resolution (ER), which identifies records referring to the same real-world entity from a single, messy dataset, is a fundamental task in data management and mining. However, the dominant blocking-matching-clustering paradigm for ER suffers from critical flaws. Its cascaded, decoupled workflow essentially produces a static, sparse graph plagued by missing edges (due to blocking failures) and noisy links (due to matching errors), causing error propagation and yielding suboptimal clusters, particularly when rigid transitivity is imposed in the clustering. We contend that matching and clustering are fundamentally synergistic, both optimizing for the construction of an ideal entity graph. Building upon this insight, we propose Alper, a unified framework that integrates these steps into an iterative probabilistic label propagation process over a global, evolving graph. Unlike disjoint blocking, Alper refines the graph structure and labels dynamically by adaptively integrating "weak but cheap" signals from graph propagation with "strong but expensive" LLM-based pairwise queries. For higher cost-effectiveness, we formulate the signal selection as a constrained optimization problem maximizing cumulative marginal gain under a query budget, solved via our greedy algorithm with provable theoretical guarantees. Our extensive experiments over eight benchmark datasets demonstrate that Alper is consistently superior to state-of-the-art cascaded pipelines.