Search papers, labs, and topics across Lattice.
This paper introduces a novel Sequential Monte Carlo (SMC) algorithm for scalable online clustering that decomposes the problem into approximately independent subproblems to reduce memory requirements. The method maintains a tractable representation of uncertainty over cluster assignments, even with complex cluster distributions like those found in text data. Experiments demonstrate the algorithm's accuracy and efficiency in knowledge base construction and other settings where traditional SMC methods fail due to memory constraints.
Overcome the memory bottleneck of Sequential Monte Carlo clustering with a decomposition that enables scaling to large datasets without sacrificing accuracy.
In online clustering problems, there is often a large amount of uncertainty over possible cluster assignments that cannot be resolved until more data are observed. This difficulty is compounded when clusters follow complex distributions, as is the case with text data. Sequential Monte Carlo (SMC) methods give a natural way of representing and updating this uncertainty over time, but have prohibitive memory requirements for large-scale problems. We propose a novel SMC algorithm that decomposes clustering problems into approximately independent subproblems, allowing a more compact representation of the algorithm state. Our approach is motivated by the knowledge base construction problem, and we show that our method is able to accurately and efficiently solve clustering problems in this setting and others where traditional SMC struggles.