Search papers, labs, and topics across Lattice.
This paper introduces ProReFF, a feature field model that learns relative distributions of features from pre-trained vision-language models to capture object co-occurrences in unlabeled data. A learning-based alignment strategy handles potentially contradictory data by creating a coherent relative distribution. An object search agent using ProReFF as a semantic prior demonstrates 20% improved efficiency over feature-based baselines and achieves 80% of human performance in Matterport3D.
Forget explicit labels: this method learns object co-occurrence priors directly from unlabeled visual data, rivaling human search efficiency.
Object co-occurrences provide a key cue for finding objects successfully and efficiently in unfamiliar environments. Typically, one looks for cups in kitchens and views fridges as evidence of being in a kitchen. Such priors have also been exploited in artificial agents, but they are typically learned from explicitly labeled data or queried from language models. It is still unclear whether these relations can be learned implicitly from unlabeled observations alone. In this work, we address this problem and propose ProReFF, a feature field model trained to predict relative distributions of features obtained from pre-trained vision language models. In addition, we introduce a learning-based strategy that enables training from unlabeled and potentially contradictory data by aligning inconsistent observations into a coherent relative distribution. For the downstream object search task, we propose an agent that leverages predicted feature distributions as a semantic prior to guide exploration toward regions with a high likelihood of containing the object. We present extensive evaluations demonstrating that ProReFF captures meaningful relative feature distributions in natural scenes and provides insight into the impact of our proposed alignment step. We further evaluate the performance of our search agent in 100 challenges in the Matterport3D simulator, comparing with feature-based baselines and human participants. The proposed agent is 20% more efficient than the strongest baseline and achieves up to 80% of human performance.