Search papers, labs, and topics across Lattice.
The authors introduce AstroLLaVA, a vision-language model fine-tuned from LLaVA, for interacting with astronomical imagery via natural language. They fine-tuned LLaVA on ~30k images with captions and question-answer pairs from sources like NASA's APOD and the Hubble Space Telescope using a two-stage process. AstroLLaVA demonstrates strong performance on an astronomical visual question answering benchmark, paving the way for aligning diverse astronomical data with pre-trained language models.
Imagine asking an AI "What type of galaxy is shown in this Hubble image?" and getting a detailed, accurate answer – AstroLLaVA makes this a reality for astronomical data.
We present AstroLLaVA, a vision language model for astronomy that enables interaction with astronomical imagery through natural dialogue. By fine-tuning the LLaVA model on a diverse dataset of $\sim$30k images with captions and question-answer pairs sourced from NASA's `Astronomy Picture of the Day', the European Southern Observatory, and the NASA/ESA Hubble Space Telescope, we create a model capable of answering open-ended questions about astronomical concepts depicted visually. Our two-stage fine-tuning process adapts the model to both image captioning and visual question answering in the astronomy domain. We demonstrate AstroLLaVA's performance on an astronomical visual question answering benchmark and release the model weights, code, and training set to encourage further open source work in this space. Finally, we suggest a roadmap towards general astronomical data alignment with pre-trained language models, and provide an open space for collaboration towards this end for interested researchers.