Ying Zhang

College of Computer Science VCIP, DISSec, TMCC, TBI Center Nankai University, D: advancing 3d medical image analysis with multi-modal large language models. arXiv preprint arXiv:2407.07846. Cited by: §1, §2.2, Table 1, §3.3, §4.1, §5.1, Table 2, Table 4, Table 4, Table 6, Table 6, Table 6. [3] V. I. Butoi, J. J. G. Ortiz, T. Ma, M. R. Sabuncu, J. Guttag, and A. V. Dalca (2023) UniverSeg: universal medical image segmentation. External Links: 2304.06131, Link Cited by: §2.3. [4] C. Chen, J. Miao, D. Wu, A. Zhong, Z. Yan, S. Kim, J. Hu, Z. Liu, L. Sun, X. Li, T. Liu, P. Heng, and Q. Li (2024) MA-sam: modality-agnostic sam adaptation for 3d medical image segmentation. Medical Image Analysis 98, pp. 103310. External Links: Document, ISSN 1361-8415, Link Cited by: §2.3. [5] Y. Chen and et al. (2024) MIMO: a medical vision language model with visual referring multimodal input and pixel grounding multimodal output. arXiv preprint arXiv:2408.07972. Cited by: §2.2. [6] Z. Chen and et al. (2024) InternVL: scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, pp. 24185–24198. External Links: Document Cited by: §2.1, §3.3. [7] G. Comanici and et al. (2025) Gemini 2.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261. External Links: Document Cited by: §2.1. [8] Y. Du, F. Bai, T. Huang, and B. Zhao (2024) SegVol: universal and interactive volumetric medical image segmentation. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024) — Main Conference Track, External Links: Document, Link Cited by: §2.3, Table 4. [9] I. E. Hamamci and et al. (2024) Developing generalist foundation models from a multimodal dataset for 3d computed tomography. In Review. External Links: Document Cited by: §1, §2.2, Table 1, §4.1, §5.1, Table 2, Table 3. [10] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen (2021) LoRA: low-rank adaptation of large language models. External Links: 2106.09685, Link Cited by: §3.3. [11] X. Huang and et al. (2024) Towards a multimodal large language model with pixel-level insight for biomedicine. arXiv preprint arXiv:2408.02755. Cited by: §2.2. [12] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Lo, P. Dollár, and R. Girshick (2023) Segment anything. External Links: 2304.02643, Link Cited by: §2.3. [13] X. Lai and et al. (2024) LISA: reasoning segmentation via large language model. In CVPR, pp. 9579–9589. External Links: Document Cited by: §2.1. [14] A. Lavie and A. Agarwal (2007) Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, StatMT ’07, USA, pp. 228–231. Cited by: §5.1. [15] C. Lee, S. Park, C. Shin, W. H. Choi, H. J. Park, J. E. Lee, and J. C. Ye (2024) Read like a radiologist: efficient vision-language model for 3d medical imaging interpretation. External Links: 2412.13558, Link Cited by: §2.2, Table 1. [16] S. Lee, J. Youn, H. Kim, M. Kim, and S. H. Yoon (2024) CXR-llava: a multimodal large language model for interpreting chest x-ray images. External Links: 2310.18341, Link Cited by: §5.1, Table 2. [17] B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y. Li, Z. Liu, and C. Li (2024) LLaVA-onevision: easy visual task transfer. External Links: 2408.03326, Link Cited by: §2.1. [18] C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao (2023) LLaVA-med: training a large language-and-vision assistant for biomedicine in one day. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023) — Track on Datasets and Benchmarks, External Links: Link Cited by: §2.2, §5.1, Table 2. [19] F. Li and et al. (2024) LLaVA-next-interleave: tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2409.17146. Cited by: §2.1. [20] J. Li, D. Li, S. Savarese, and S. Hoi (2023) BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597. Cited by: §2.1. [21] C. Lin (2004-07) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Link Cited by: §5.1. [22] J. Lin and et al. (2023) VILA: on pre-training for visual language models. arXiv preprint arXiv:2306.00989. Cited by: §2.1. [23] T. Lin, W. Zhang, S. Li, Y. Yuan, B. Yu, H. Li, W. He, H. Jiang, M. Li, X. Song, S. Tang, J. Xiao, H. Lin, Y. Zhuang, and B. C. Ooi (2025) HealthGPT: a medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation. External Links: 2502.09838, Link Cited by: §2.2. [24] H. Liu, C. Li, Q. Wu, and Y. J. Lee (2023) Visual instruction tuning. arXiv preprint arXiv:2304.08485. Cited by: §2.1, §5.1, Table 2. [25] L. Luo, B. Tang, X. Chen, R. Han, and T. Chen (2025) VividMed: vision language model with versatile visual grounding for medicine. External Links: 2410.12694, Link Cited by: §2.2. [26] J. Ma, Y. He, F. Li, L. Han, C. You, and B. Wang (2024-01) Segment anything in medical images. Nature Communications 15 (1). External Links: ISSN 2041-1723, Link, Document Cited by: §2.3. [27] J. Ma, S. Kim, F. Li, M. Baharoon, R. Asakereh, H. Lyu, and B. Wang (2024) Segment anything in medical images and videos: benchmark and deployment. External Links: 2408.03322, Link Cited by: §2.3, Table 1. [28] J. Ma, Y. Zhang, S. Gu, C. Zhu, C. Ge, Y. Zhang, X. An, C. Wang, Q. Wang, X. Liu, S. Cao, Q. Zhang, S. Liu, Y. Wang, Y. Li, J. He, and X. Yang (2022) AbdomenCT-1k: is abdominal organ segmentation a solved problem?. 44 (10), pp. 6695–6714. External Links: Document Cited by: §5.3. [29] M. Moor, Q. Huang, S. Wu, M. Yasunaga, Y. Dalmia, J. Leskovec, C. Zakka, E. P. Reis, and P. Rajpurkar (2023-10 Dec) Med-flamingo: a multimodal medical few-shot learner. In Proceedings of the 3rd Machine Learning for Health Symposium, S. Hegselmann, A. Parziale, D. Shanmugam, S. Tang, M. N. Asiedu, S. Chang, T. Hartvigsen, and H. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 225, pp. 353–367. External Links: Link Cited by: §2.2. [30] V. Nath and et al. (2024) VILA-m3: enhancing vision-language models with medical expert knowledge. arXiv preprint arXiv:2405.19665. Cited by: §1, §2.2. [31] OpenAI and et al (2024) GPT-4o system card. External Links: 2410.21276, Link Cited by: §2.1. [32] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, USA, pp. 311–318. External Links: Link, Document Cited by: §5.1. [33] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever (2021) Learning transferable visual models from natural language supervision. External Links: 2103.00020, Link Cited by: §3.3, Table 6. [34] H. Rasheed and et al. (2024) GLaMM: pixel grounding large multimodal model. In CVPR, pp. 13009–13018. External Links: Document Cited by: §2.1. [35] N. Ravi, V. Gabeur, Y. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C. Wu, R. Girshick, P. Dollár, and C. Feichtenhofer (2024) SAM 2: segment anything in images and videos. External Links: 2408.00714, Link Cited by: §2.3. [36] B. Rister, D. Yi, K. Shivakumar, T. Nobashi, and D. L. Rubin (2020) CT-org, a new dataset for multiple organ segmentation in computed tomography. 7 (1), pp. 381. External Links: Document Cited by: §5.3. [37] T. Shaharabany, A. Dahan, R. Giryes, and L. Wolf (2023) AutoSAM: adapting sam to medical images by overloading the prompt encoder. External Links: 2306.06370, Link Cited by: §2.3. [38] Y. Shi, X. Zhu, K. Wang, Y. Hu, C. Guo, M. Li, and J. Wu (2025) Med-2e3: a 2d-enhanced 3d medical multimodal large language model. External Links: 2411.12783, Link Cited by: §2.2, Table 1. [39] I. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, M. Lucic, and A. Dosovitskiy (2021) MLP-mixer: an all-mlp architecture for vision. External Links: 2105.01601, Link Cited by: §3.3. [40] S. Tong, E. Brown, P. Wu, S. Woo, M. Middepogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, A. Wang, R. Fergus, Y. LeCun, and S. Xie (2024) Cambrian-1: a fully open, vision-centric exploration of multimodal llms. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024) — Main Conference Track, External Links: Document, Link Cited by: §4.2. [41] R. Vedantam, C. L. Zitnick, and D. Parikh (2015) CIDEr: consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575. External Links: Document, Link Cited by: §5.1. [42] J. Wasserthal, H. Breit, M. T. Meyer, M. Pradella, D. Hinck, A. W. Sauter, T. Heye, D. T. Boll, J. Cyriac, S. Yang, M. Bach, and M. Segeroth (2023) TotalSegmentator: robust segmentation of 104 anatomic structures in ct images. 5 (5), pp. e230024. External Links: Document, Link, https://doi.org/10.1148/ryai.230024, College of Computer Science, Nankai University

OpenAI

Papers on Lattice

Total citations

Topics

h-index

Research focus

Computer Vision (1)Multimodal Models (1)Natural Language Processing (1)

Frequent co-authors

Baohang Zhou (1)Kehui Song (1)Rize Jin (1)Yu Zhao (1)

Papers (1)

Mar 17, 2026

School of Software Tiangong UniversityMar 17, 2026·also OpenAI, Bilibili Inc., College of Computer Science, College of Computer Science VCIP

Hyperbolic Multimodal Generative Representation Learning for Generalized Zero-Shot Multimodal Information Extraction

Existing zero-shot multimodal information extraction models struggle with real-world scenarios containing both seen and unseen categories, but this work solves it by modeling hierarchical semantic relationships in hyperbolic space and aligning semantic similarity distributions.

Baohang Zhou, Kehui Song, Rize Jin +5

Computer Vision Multimodal Models Natural Language Processing

Search

Ying Zhang

Research focus

Frequent co-authors

Papers (1)