HarvardPKUSchool of Intelligence Science and TechnologyState Key Laboratory of GeneralApr 20, 2026arXiv:2604.18258

Long-Text-to-Image Generation via Compositional Prompt Decomposition

AI Summary

This paper introduces Prompt Refraction for Intricate Scene Modeling (PRISM), a novel compositional method that allows pre-trained text-to-image (T2I) models to effectively handle long descriptive prompts. By extracting constituent representations and making independent noise predictions for each component, PRISM merges these outputs into a single denoising step, significantly improving fidelity compared to existing approaches. Evaluations reveal that PRISM not only matches the performance of fine-tuned models but also surpasses baseline models by 7.4% on prompts exceeding 500 tokens, showcasing its superior generalization capabilities.

Key Contribution

PRISM enables T2I models to generate high-fidelity images from lengthy prompts, outperforming traditional methods by a notable margin.

Abstract

While modern text-to-image (T2I) models excel at generating images from intricate prompts, they struggle to capture the key details when the inputs are descriptive paragraphs. This limitation stems from the prevalence of concise captions that shape their training distributions. Existing methods attempt to bridge this gap by either fine-tuning T2I models on long prompts, which generalizes poorly to longer lengths; or by projecting the oversize inputs into normal-prompt space and compromising fidelity. We propose Prompt Refraction for Intricate Scene Modeling (PRISM), a compositional approach that enables pre-trained T2I models to process long sequence inputs. PRISM uses a lightweight module to extract constituent representations from the long prompts. The T2I model makes independent noise predictions for each component, and their outputs are merged into a single denoising step using energy-based conjunction. We evaluate PRISM across a wide range of model architectures, showing comparable performances to models fine-tuned on the same training data. Furthermore, PRISM demonstrates superior generalization, outperforming baseline models by 7.4% on prompts over 500 tokens in a challenging public benchmark.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Long-Text-to-Image Generation via Compositional Prompt Decomposition

Related Papers