Search papers, labs, and topics across Lattice.
The paper introduces a two-stage Bilevel Reinforcement Learning from Human Feedback (RLHF) framework to align ride-sharing vehicle repositioning strategies with human preferences. The first stage uses DQN-RLHF for warm-starting a preference-aligned reward model and reference policy, while the second stage fine-tunes the policy with KL-regularized PPO-RLHF using LLM-generated or rubric-based preference labels. Experiments demonstrate that the proposed framework reduces wait times and empty miles while improving service rates and maintaining platform profit, showcasing the effective incorporation of human preference alignment in large-scale ride-sharing.
Human-aligned ride-sharing repositioning is now possible without sacrificing platform profit, thanks to a novel two-stage Bilevel RLHF framework.
Vehicle repositioning is essential for improving efficiency and service quality in ride-sharing platforms, yet existing approaches typically optimize proxy rewards that fail to reflect human-centered preferences such as wait time, service coverage, and unnecessary empty travel. We propose the first two-stage Bilevel Reinforcement Learning (RL) from Human Feedback (RLHF) framework for preference-aligned vehicle repositioning. In Stage 1, a value-based Deep Q-Network (DQN)-RLHF warm start learns an initial preference-aligned reward model and stable reference policy, mitigating the reward-model drift and cold-start instability that arise when applying on-policy RLHF directly. In Stage 2, a Kullback–Leibler (KL)-regularized Proximal Policy Optimization (PPO)-RLHF algorithm, equipped with action masking, behavioral-cloning anchoring, and alternating forward–reverse KL, fine-tunes the repositioning policy using either Large Language Model (LLM)-generated or rubric-based preference labels. We develop and compare two coordination schemes, pure alternating (PPO-Alternating) and k-step alternating (PPO-k-step), demonstrating that both yield consistent improvements across all tested arrival scales. Empirically, our framework reduces wait time and empty-mile ratio while improving served rate, without inducing trade-offs or reducing platform profit. These results show that human preference alignment can be stably and effectively incorporated into large-scale ride-sharing repositioning.