BUPTHuaweiStevensUSTCFeb 26, 2026arXiv:2602.22743

Generative Data Transformation: From Mixed to Unified Data

Jiaqing Zhang, Jiaqing Zhang, Mingjia Yin, Mingjia Yin, Hao Wang, Yuxin Tian, Yuxin Tian, Yuyang Ye, Yuyang Ye, Yawen Li, Wei Guo, Wei Guo, Yong Liu, Enhong Chen, Enhong Chen

AI Summary

The paper introduces Taesar, a data-centric framework that addresses data sparsity and cold start problems in recommendation systems by generating target-aligned sequential data from mixed domains. Taesar employs a contrastive decoding mechanism to encode cross-domain context into target-domain sequences, mitigating negative transfer and domain gaps. Experiments demonstrate that Taesar outperforms model-centric approaches and generalizes across various sequential models by creating enriched datasets suitable for standard recommendation models.

Key Contribution

Forget complex model architectures for cross-domain recommendation: Taesar shows that cleverly transforming your data can unlock better performance with standard models.

Abstract

Recommendation model performance is intrinsically tied to the quality, volume, and relevance of their training data. To address common challenges like data sparsity and cold start, recent researchs have leveraged data from multiple auxiliary domains to enrich information within the target domain. However, inherent domain gaps can degrade the quality of mixed-domain data, leading to negative transfer and diminished model performance. Existing prevailing \emph{model-centric} paradigm -- which relies on complex, customized architectures -- struggles to capture the subtle, non-structural sequence dependencies across domains, leading to poor generalization and high demands on computational resources. To address these shortcomings, we propose \textsc{Taesar}, a \emph{data-centric} framework for \textbf{t}arget-\textbf{a}lign\textbf{e}d \textbf{s}equenti\textbf{a}l \textbf{r}egeneration, which employs a contrastive decoding mechanism to adaptively encode cross-domain context into target-domain sequences. It employs contrastive decoding to encode cross-domain context into target sequences, enabling standard models to learn intricate dependencies without complex fusion architectures. Experiments show \textsc{Taesar} outperforms model-centric solutions and generalizes to various sequential models. By generating enriched datasets, \textsc{Taesar} effectively combines the strengths of data- and model-centric paradigms. The code accompanying this paper is available at~ \textcolor{blue}{https://github.com/USTC-StarTeam/Taesar}.

Data Curation & Synthetic Data Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Generative Data Transformation: From Mixed to Unified Data

Related Papers