Search papers, labs, and topics across Lattice.
The paper introduces UniCSG, a diffusion-based framework for style transfer that addresses content-style entanglement by employing staged training. The first stage uses low-frequency preprocessing and conditioning corruption to achieve semantic disentanglement in the latent space. The second stage refines details with multi-scale frequency supervision, further enhanced by pixel-space reward learning to improve perceptual quality.
Achieve high-fidelity style transfer without content leakage by disentangling semantics and frequencies in the latent space of diffusion models.
Style transfer must match a target style while preserving content semantics. DiT-based diffusion models often suffer from content-style entanglement, leading to reference-content leakage and unstable generation. We present UniCSG, a unified framework for content-constrained, style-driven generation in both text-guided and reference-guided settings. UniCSG employs staged training: (i) a latent-space semantic disentanglement stage that combines low-frequency preprocessing with conditioning corruption to encourage content-style separation, and (ii) a latent-space frequency-aware detail reconstruction stage that refines details via multi-scale frequency supervision. We further incorporate pixel-space reward learning to align latent objectives with perceptual quality after decoding. Experiments demonstrate improved content faithfulness, style alignment, and robustness in both settings.