CVCCVIT LabHabitat LabsUNCApr 12, 2025arXiv:2504.09249

NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding

Aniket Pal, Sanket Biswas, Alloy Das, Ayush Lodh, Priyanka Banerjee, Soumitri Chattopadhyay, Dimosthenis Karatzas, J. Lladós, C.V. Jawahar

AI Summary

The authors introduce NoTeS-Bank, a new benchmark for evaluating Neural Transcription and Search in the context of handwritten scientific notes, which contain complex mathematical equations, diagrams, and scientific notations. NoTeS-Bank includes two tasks: Evidence-Based VQA (retrieving answers with bounding box evidence) and Open-Domain VQA (classifying domain and retrieving relevant documents and answers). Benchmarking state-of-the-art VLMs and retrieval frameworks on NoTeS-Bank reveals limitations in structured transcription and multimodal reasoning, highlighting the need for improved vision-language fusion techniques.

Key Contribution

Handwritten scientific notes, a common yet challenging document type, now have a dedicated benchmark (NoTeS-Bank) exposing the limitations of current VLMs in transcription and reasoning.

Abstract

Understanding and reasoning over academic handwritten notes remains a challenge in document AI, particularly for mathematical equations, diagrams, and scientific notations. Existing visual question answering (VQA) benchmarks focus on printed or structured handwritten text, limiting generalization to real-world note-taking. To address this, we introduce NoTeS-Bank, an evaluation benchmark for Neural Transcription and Search in note-based question answering. NoTeS-Bank comprises complex notes across multiple domains, requiring models to process unstructured and multimodal content. The benchmark defines two tasks: (1) Evidence-Based VQA, where models retrieve localized answers with bounding-box evidence, and (2) Open-Domain VQA, where models classify the domain before retrieving relevant documents and answers. Unlike classical Document VQA datasets relying on optical character recognition (OCR) and structured data, NoTeS-BANK demands vision-language fusion, retrieval, and multimodal reasoning. We benchmark state-of-the-art Vision-Language Models (VLMs) and retrieval frameworks, exposing structured transcription and reasoning limitations. NoTeS-Bank provides a rigorous evaluation with NDCG@5, MRR, Recall@K, IoU, and ANLS, establishing a new standard for visual document understanding and reasoning.

Eval Frameworks & Benchmarks Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations1

Influential citations0

References54

Year2025

VenuearXiv.org

Related Papers

Finding related papers...