Sep 10, 2025arXiv:2509.08489

Prompt-Driven Image Analysis with Multimodal Generative AI: Detection, Segmentation, Inpainting, and Interpretation

AI Summary

This paper presents a unified pipeline for prompt-driven image analysis, integrating open-vocabulary detection, promptable segmentation, text-conditioned inpainting, and vision-language description into a single workflow. The pipeline is designed for transparency and repeatability, offering both an interactive UI and a scriptable CLI, while retaining intermediate artifacts for debugging. The study demonstrates high accuracy in detection and segmentation from single-word prompts and provides implementation-guided advice on optimizing performance, particularly regarding inpainting runtime and parameter tuning.

Key Contribution

Unlock a new level of image manipulation with a single prompt: this pipeline seamlessly integrates detection, segmentation, inpainting, and description, offering a transparent and reliable pattern for assembling modern vision and multimodal models.

Abstract

Prompt-driven image analysis converts a single natural-language instruction into multiple steps: locate, segment, edit, and describe. We present a practical case study of a unified pipeline that combines open-vocabulary detection, promptable segmentation, text-conditioned inpainting, and vision-language description into a single workflow. The system works end to end from a single prompt, retains intermediate artifacts for transparent debugging (such as detections, masks, overlays, edited images, and before and after composites), and provides the same functionality through an interactive UI and a scriptable CLI for consistent, repeatable runs. We highlight integration choices that reduce brittleness, including threshold adjustments, mask inspection with light morphology, and resource-aware defaults. In a small, single-word prompt segment, detection and segmentation produced usable masks in over 90% of cases with an accuracy above 85% based on our criteria. On a high-end GPU, inpainting makes up 60 to 75% of total runtime under typical guidance and sampling settings, which highlights the need for careful tuning. The study offers implementation-guided advice on thresholds, mask tightness, and diffusion parameters, and details version pinning, artifact logging, and seed control to support replay. Our contribution is a transparent, reliable pattern for assembling modern vision and multimodal models behind a single prompt, with clear guardrails and operational practices that improve reliability in object replacement, scene augmentation, and removal.

Computer Vision Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References40

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Prompt-Driven Image Analysis with Multimodal Generative AI: Detection, Segmentation, Inpainting, and Interpretation

Related Papers