BUPTUCSCFeb 23, 2026arXiv:2602.19424

Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

Yuxuan Yang, Yuxuan Yang, Zhonghao Yan, Zhonghao Yan, Yi Zhang, Boxiang Yun, Bo Yun, Muxi Diao, Muxi Diao, Guowei Zhao, Guowei Zhao, Kongming Liang, Kongming Liang, Wenbin Li, Wenbin Li, Zhanyu Ma, Zhanyu Ma

AI Summary

The paper introduces Hepato-LLaVA, a multi-modal large language model tailored for hepatocellular pathology analysis using whole slide images, addressing limitations of fixed-resolution processing and inefficient feature aggregation in existing methods. A key innovation is the Sparse Topo-Pack Attention mechanism, which models 2D tissue topology to aggregate local diagnostic evidence into semantic summary tokens while maintaining global context. The model is trained and evaluated on a new dataset, HepatoPathoVQA, consisting of 33K expert-validated, hierarchically structured question-answer pairs, demonstrating state-of-the-art performance in HCC diagnosis and captioning.

Key Contribution

You can now get state-of-the-art hepatocellular carcinoma diagnosis and captioning from whole slide images using a new MLLM with a topology-aware attention mechanism.

Abstract

Hepatocellular Carcinoma diagnosis relies heavily on the interpretation of gigapixel Whole Slide Images. However, current computational approaches are constrained by fixed-resolution processing mechanisms and inefficient feature aggregation, which inevitably lead to either severe information loss or high feature redundancy. To address these challenges, we propose Hepato-LLaVA, a specialized Multi-modal Large Language Model designed for fine-grained hepatocellular pathology analysis. We introduce a novel Sparse Topo-Pack Attention mechanism that explicitly models 2D tissue topology. This mechanism effectively aggregates local diagnostic evidence into semantic summary tokens while preserving global context. Furthermore, to overcome the lack of multi-scale data, we present HepatoPathoVQA, a clinically grounded dataset comprising 33K hierarchically structured question-answer pairs validated by expert pathologists. Our experiments demonstrate that Hepato-LLaVA achieves state-of-the-art performance on HCC diagnosis and captioning tasks, significantly outperforming existing methods. Our code and implementation details are available at https://pris-cv.github.io/Hepto-LLaVA/.

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References18

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

Related Papers