MIT CSAILIBM ResearchFeb 14, 2025arXiv:2502.09927

Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence

Granite Vision Team Leonid Karlinsky, Assaf Arbelle, Abraham Daniels, Ahmed S. Nassar, Amit Alfassi, Bo Wu, Eli Schwartz, Dhiraj Joshi, Jovana Kondic, Nimrod Shabtay, Pengyuan Li, Roei Herzig, Shafiq Abedin, Shaked Perek, Sivan Harary, U. Barzelay, Adi Raz Goldfarb, Aude Oliva, Ben Wieles, Bishwaranjan Bhattacharjee, Brandon Huang, Christoph Auer, Dan Gutfreund, D. Beymer, David Wood, Hildegard Kuehne, Jacob Hansen, J. Shtok, Ken Wong, Luis Angel D. Bathen, Mayank Mishra, Maksym Lysak, Michele Dolfi, Mikhail Yurochkin, Nikolaos Livathinos, Nimrod Harel, Ophir Azulai, O. Naparstek, Rafael Teixeira de Lima, Rameswar Panda, Sivan Doveh, Shubham Gupta, Subhro Das, Syed Zawad, Yusik Kim, Zexue He, Alexander Brooks, Gabe Goodhart, A. Govindjee, Derek Leist, Ibrahim Ibrahim, A. Soffer, David Cox, Kate Soule, Luis A. Lastras, Nirmit Desai, Shila Ofek-Koifman, Sriram Raghavan, T. Syeda-Mahmood, Peter W. J. Staar, Tal Drory, Rogério Feris

AI Summary

The authors introduce Granite Vision, a 2 billion parameter vision-language model tailored for enterprise applications, especially visual document understanding. The model is trained on an instruction-following dataset encompassing document-related tasks like table and chart extraction, along with general image tasks, using a decoder-only architecture aligned with the visual modality. Granite Vision demonstrates strong performance on visual document understanding benchmarks and the LiveXiv benchmark, while also incorporating a safety classification method based on sparse attention vectors to detect harmful inputs.

Key Contribution

A new 2B parameter vision-language model, Granite Vision, rivals larger models on visual document understanding tasks while offering a transparent and commercially-friendly open-source license.

Abstract

We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, as well as general image tasks. The architecture of Granite Vision is centered around visual modality alignment with a decoder-only, 2 billion parameter Granite large language model. Additionally, we introduce a dedicated safety classification approach in test-time that leverages a sparse set of attention vectors to identify potential harmful inputs. Despite its lightweight architecture, Granite Vision achieves strong results in standard benchmarks related to visual document understanding, as well as on the LiveXiv benchmark, which is designed to avoid test set contamination by using a constantly updated corpus of recently published Arxiv papers. We are releasing the model under the Apache-2 license, allowing for both research and commercial use, while offering complete visibility into the training data and other relevant details. See https://huggingface.co/ibm-granite/ for model weights.

Computer Vision Multimodal Models Open-Source Models & Weights

Citation Metrics

Citations16

Influential citations2

References74

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence

Related Papers