Search papers, labs, and topics across Lattice.
The paper identifies a key weakness in standard dimensionality reduction techniques like UMAP and t-SNE: they tend to preserve sampling noise and distort global topology due to their local-neighborhood objectives. To address this, the authors introduce a topology-faithfulness benchmark based on noisy manifolds with known homology and tune their DiRe algorithm against it. DiRe achieves Pareto-optimal performance, matching or exceeding GPU-accelerated UMAP on classification tasks while significantly improving the recovery of topological features on stress tests and a large-scale arXiv dataset.
Popular dimensionality reduction techniques like UMAP can *invent* topological structure not present in the original data, but DiRe avoids this pitfall while matching UMAP's speed and classification performance.
Dimensionality reduction methods such as UMAP and t-SNE are central tools for visualising high-dimensional data, but their local-neighborhood objectives can preserve sampling noise while distorting global topology. We show that standard local metrics reward this noise memorisation: top-performing embeddings invent cycles and disconnected islands absent from the data. We introduce a topology-faithfulness benchmark based on noisy manifolds with known homology, tune DiRe against it, and find Pareto-optimal configurations that match or beat GPU-accelerated UMAP on classification while recovering exact first Betti numbers on stress tests. On 723K arXiv paper embeddings, DiRe preserves 3-4 times more topological structure than UMAP at comparable wall-clock.