Search papers, labs, and topics across Lattice.
The authors introduce Granite Vision, a 2 billion parameter vision-language model tailored for enterprise applications, especially visual document understanding. The model is trained on an instruction-following dataset encompassing document-related tasks like table and chart extraction, along with general image tasks, using a decoder-only architecture aligned with the visual modality. Granite Vision demonstrates strong performance on visual document understanding benchmarks and the LiveXiv benchmark, while also incorporating a safety classification method based on sparse attention vectors to detect harmful inputs.
A new 2B parameter vision-language model, Granite Vision, rivals larger models on visual document understanding tasks while offering a transparent and commercially-friendly open-source license.
We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, as well as general image tasks. The architecture of Granite Vision is centered around visual modality alignment with a decoder-only, 2 billion parameter Granite large language model. Additionally, we introduce a dedicated safety classification approach in test-time that leverages a sparse set of attention vectors to identify potential harmful inputs. Despite its lightweight architecture, Granite Vision achieves strong results in standard benchmarks related to visual document understanding, as well as on the LiveXiv benchmark, which is designed to avoid test set contamination by using a constantly updated corpus of recently published Arxiv papers. We are releasing the model under the Apache-2 license, allowing for both research and commercial use, while offering complete visibility into the training data and other relevant details. See https://huggingface.co/ibm-granite/ for model weights.