Search papers, labs, and topics across Lattice.
This paper introduces a pipeline for generating multi-speaker podcast dialogues from image sequences using Vision-Language Models (VLMs), addressing the limitations of current VLMs in creating engaging, long-form narratives. The authors fine-tuned a Qwen3-VL-32B model on a dataset of 4,000 image-dialogue pairs, employing a synthetic-to-real training strategy using the SPoRC and VIST datasets. Results show that the fine-tuned 32B model outperforms a 235B base model in conversational naturalness and narrative depth, as evaluated by AI-as-a-judge and novel style metrics, while maintaining visual grounding.
Forget scaling laws: a fine-tuned 32B VLM can beat a 235B behemoth at generating engaging, multi-speaker podcast dialogues from images.
Vision-Language Models (VLMs) have achieved remarkable success in descriptive tasks such as image captioning and visual question answering (VQA). However, their ability to generate engaging, long-form narratives -- specifically multi-speaker podcast dialogues -- remains under-explored and difficult to evaluate. Standard metrics like BLEU and ROUGE fail to capture the nuances of conversational naturalness, personality, and narrative flow, often rewarding safe, repetitive outputs over engaging storytelling. In this work, we present a novel pipeline for end-to-end visual podcast generation, and fine-tune a Qwen3-VL-32B model on a curated dataset of 4,000 image-dialogue pairs. Crucially, we use a synthetic-to-real training strategy: we train on high-quality podcast dialogues from the Structured Podcast Research Corpus (SPoRC) paired with synthetically generated imagery, and evaluate on real-world photo sequences from the Visual Storytelling Dataset (VIST). This rigorous setup tests the model's ability to generalize from synthetic training data to real-world visual domains. We propose a comprehensive evaluation framework that moves beyond textual overlap, and use AI-as-a-judge (Gemini 3 Pro, Claude Opus 4.5, GPT 5.2) and novel style metrics (average turn length, speaker switch rate) to assess quality. Our experiments demonstrate that our fine-tuned 32B model significantly outperforms a 235B base model in conversational naturalness ($>$80\% win rate) and narrative depth (+50\% turn length), while maintaining identical visual grounding capabilities (CLIPScore: 20.39).