Ben-Gurion University of the NegevMar 19, 2026arXiv:2603.18558

HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

D. Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin

AI Summary

The paper introduces HiMu, a training-free hierarchical multimodal frame selection framework for long video question answering that addresses the trade-off between speed and accuracy in existing methods. HiMu decomposes the query into a hierarchical logic tree, routes atomic predicates to lightweight multimodal experts (vision, audio, text), and composes the resulting signals using fuzzy logic to enforce temporal relationships. Experiments on Video-MME, LongVideoBench, and HERBench-Lite demonstrate that HiMu achieves state-of-the-art efficiency-accuracy trade-offs, outperforming existing selectors and even surpassing agentic systems with significantly fewer FLOPs.

Key Contribution

Get GPT-4o-level long-video QA performance with 10x fewer FLOPs by using a hierarchical, training-free frame selector that combines multimodal experts and fuzzy logic.

Abstract

Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

Related Papers