Search papers, labs, and topics across Lattice.
The authors introduce NoTeS-Bank, a new benchmark for evaluating Neural Transcription and Search in the context of handwritten scientific notes, which contain complex mathematical equations, diagrams, and scientific notations. NoTeS-Bank includes two tasks: Evidence-Based VQA (retrieving answers with bounding box evidence) and Open-Domain VQA (classifying domain and retrieving relevant documents and answers). Benchmarking state-of-the-art VLMs and retrieval frameworks on NoTeS-Bank reveals limitations in structured transcription and multimodal reasoning, highlighting the need for improved vision-language fusion techniques.
Handwritten scientific notes, a common yet challenging document type, now have a dedicated benchmark (NoTeS-Bank) exposing the limitations of current VLMs in transcription and reasoning.
Understanding and reasoning over academic handwritten notes remains a challenge in document AI, particularly for mathematical equations, diagrams, and scientific notations. Existing visual question answering (VQA) benchmarks focus on printed or structured handwritten text, limiting generalization to real-world note-taking. To address this, we introduce NoTeS-Bank, an evaluation benchmark for Neural Transcription and Search in note-based question answering. NoTeS-Bank comprises complex notes across multiple domains, requiring models to process unstructured and multimodal content. The benchmark defines two tasks: (1) Evidence-Based VQA, where models retrieve localized answers with bounding-box evidence, and (2) Open-Domain VQA, where models classify the domain before retrieving relevant documents and answers. Unlike classical Document VQA datasets relying on optical character recognition (OCR) and structured data, NoTeS-BANK demands vision-language fusion, retrieval, and multimodal reasoning. We benchmark state-of-the-art Vision-Language Models (VLMs) and retrieval frameworks, exposing structured transcription and reasoning limitations. NoTeS-Bank provides a rigorous evaluation with NDCG@5, MRR, Recall@K, IoU, and ANLS, establishing a new standard for visual document understanding and reasoning.