TCDUniversity College DublinFeb 24, 2026arXiv:2602.20728

Balancing Multiple Objectives in Urban Traffic Control with Reinforcement Learning from AI Feedback

Chenyang Zhao, Vinny Cahill, Ivana Dusparic

AI Summary

This paper extends Reinforcement Learning from AI Feedback (RLAIF) to multi-objective self-adaptive systems, addressing the challenge of balancing conflicting objectives in urban traffic control. They use LLMs to generate preference labels for traffic scenarios, enabling the RL agent to learn balanced trade-offs without explicit reward engineering. The results demonstrate that multi-objective RLAIF can produce policies reflecting different user priorities, offering a scalable approach to user-aligned policy learning.

Key Contribution

LLMs can guide multi-objective reinforcement learning to achieve balanced trade-offs in complex systems like urban traffic control, sidestepping the need for hand-engineered reward functions.

Abstract

Reward design has been one of the central challenges for real world reinforcement learning (RL) deployment, especially in settings with multiple objectives. Preference-based RL offers an appealing alternative by learning from human preferences over pairs of behavioural outcomes. More recently, RL from AI feedback (RLAIF) has demonstrated that large language models (LLMs) can generate preference labels at scale, mitigating the reliance on human annotators. However, existing RLAIF work typically focuses only on single-objective tasks, leaving the open question of how RLAIF handles systems that involve multiple objectives. In such systems trade-offs among conflicting objectives are difficult to specify, and policies risk collapsing into optimizing for a dominant goal. In this paper, we explore the extension of the RLAIF paradigm to multi-objective self-adaptive systems. We show that multi-objective RLAIF can produce policies that yield balanced trade-offs reflecting different user priorities without laborious reward engineering. We argue that integrating RLAIF into multi-objective RL offers a scalable path toward user-aligned policy learning in domains with inherently conflicting objectives.

RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Balancing Multiple Objectives in Urban Traffic Control with Reinforcement Learning from AI Feedback

Related Papers