Apr 9, 2026arXiv:2604.08050

ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning

Daichi Yashima, Shuhei Kurita, Yusuke Oda, Shuntaro Suzuki, Seitaro Otsuki, Komei Sugiura

AI Summary

This paper introduces ABMamba, a multimodal LLM for video captioning that replaces quadratic attention mechanisms with a linear complexity Deep State Space Model backbone. ABMamba employs a novel Aligned Hierarchical Bidirectional Scan module to process videos at multiple temporal resolutions, improving efficiency. Experiments on VATEX and MSR-VTT show ABMamba achieves competitive performance with 3x higher throughput compared to standard MLLMs.

Key Contribution

Achieve 3x faster video captioning without sacrificing accuracy by swapping quadratic attention for a linear Mamba backbone and hierarchical bidirectional scanning.

Abstract

In this study, we focus on video captioning by fully open multimodal large language models (MLLMs). The comprehension of visual sequences is challenging because of their intricate temporal dependencies and substantial sequence length. The core attention mechanisms of existing Transformer-based approaches scale quadratically with the sequence length, making them computationally prohibitive. To address these limitations, we propose Aligned Hierarchical Bidirectional Scan Mamba (ABMamba), a fully open MLLM with linear computational complexity that enables the scalable processing of video sequences. ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms, and employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions. On standard video captioning benchmarks such as VATEX and MSR-VTT, ABMamba demonstrates competitive performance compared to typical MLLMs while achieving approximately three times higher throughput.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References76

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning

Related Papers