PolyUMay 29, 2026arXiv:2605.31455

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

Jian Mu, Tianyi Lin, Chengwei Qin, Zhongxiang Dai, Yao Shu

AI Summary

The paper introduces DRIFT, a novel framework for optimizing LLMs in multi-turn interactive settings that combines the benefits of online RL and offline SFT. DRIFT decouples rollout generation from policy optimization by using a fixed reference policy to sample interaction trajectories and then applies importance-weighted SFT based on return values. Experiments show that DRIFT achieves performance comparable to multi-turn RL baselines while maintaining the efficiency of SFT.

Key Contribution

Get RL-level multi-turn LLM performance with SFT-level efficiency by decoupling trajectory generation and optimization via importance weighting.

Abstract

Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.

RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

Related Papers