Search papers, labs, and topics across Lattice.
The paper introduces a new task, Episodic Memory with Questions and Feedback (EM-QnF), to address the limitations of one-shot episodic memory retrieval by allowing interactive refinement of predictions based on user feedback. They collect datasets for this new task and propose a lightweight training scheme to avoid expensive sequential optimization. The authors also introduce a plug-and-play Feedback ALignment Module (FALM) that significantly improves performance on three benchmarks and generalizes well to human-generated feedback.
Interactive feedback slashes error rates in episodic memory retrieval, outperforming even large vision-language models while remaining efficient.
In episodic memory with natural language queries (EM-NLQ), a user may ask a question (e.g.,"Where did I place the mug?") that requires searching a long egocentric video, captured from the user's perspective, to find the moment that answers it. However, queries can be ambiguous or incomplete, leading to incorrect responses. Current methods ignore this key aspect and address EM-NLQ in a one-shot setup, limiting their applicability in real-world scenarios. In this work, we address this gap and introduce the Episodic Memory with Questions and Feedback task (EM-QnF). Here, the user can provide feedback on the model's initial prediction or add more information (e.g.,"Before this. I'm looking for the big blue mug not the white one"), helping the model refine its predictions interactively. To this end, we collect datasets for feedback-based interaction and propose a lightweight training scheme that avoids expensive sequential optimization. We also introduce a plug-and-play Feedback ALignment Module (FALM) that enables existing EM-NLQ models to incorporate user feedback effectively. Our approach significantly improves over the state of the art on three challenging benchmarks and is better than or competitive with commercial large vision-language models while remaining efficient. Evaluation with human-generated feedback shows that it generalizes well to real-world scenarios.