Search papers, labs, and topics across Lattice.
The paper introduces MixFormer, a unified Transformer architecture for recommender systems that jointly models sequential behaviors and feature interactions within a single backbone, addressing the co-scaling challenges of decoupled designs. By unifying parameterization, MixFormer enables effective co-scaling across dense capacity and sequence length, facilitating deep interaction between sequential and non-sequential representations. Experiments on large-scale datasets and online A/B tests on Douyin and Douyin Lite show that MixFormer achieves superior accuracy, efficiency, and improvements in user engagement metrics.
By unifying feature interaction and sequence modeling in a single Transformer backbone, MixFormer eliminates the co-scaling trade-off that plagues fragmented recommender systems and boosts user engagement in real-world deployments.
As industrial recommender systems enter a scaling-driven regime, Transformer architectures have become increasingly attractive for scaling models towards larger capacity and longer sequence. However, existing Transformer-based recommendation models remain structurally fragmented, where sequence modeling and feature interaction are implemented as separate modules with independent parameterization. Such designs introduce a fundamental co-scaling challenge, as model capacity must be suboptimally allocated between dense feature interaction and sequence modeling under a limited computational budget. In this work, we propose MixFormer, a unified Transformer-style architecture tailored for recommender systems, which jointly models sequential behaviors and feature interactions within a single backbone. Through a unified parameterization, MixFormer enables effective co-scaling across both dense capacity and sequence length, mitigating the trade-off observed in decoupled designs. Moreover, the integrated architecture facilitates deep interaction between sequential and non-sequential representations, allowing high-order feature semantics to directly inform sequence aggregation and enhancing overall expressiveness. To ensure industrial practicality, we further introduce a user-item decoupling strategy for efficiency optimizations that significantly reduce redundant computation and inference latency. Extensive experiments on large-scale industrial datasets demonstrate that MixFormer consistently exhibits superior accuracy and efficiency. Furthermore, large-scale online A/B tests on two production recommender systems, Douyin and Douyin Lite, show consistent improvements in user engagement metrics, including active days and in-app usage duration.