CMU MLNYUApr 6, 2026arXiv:2604.04808

Selecting Decision-Relevant Concepts in Reinforcement Learning

Naveen Raman, Stephanie Milani, Fei Fang

AI Summary

This paper introduces Decision-Relevant Selection (DRS), an algorithm for automatically selecting decision-relevant concepts for concept-based reinforcement learning policies. DRS leverages a state abstraction perspective, identifying concepts that, if removed, would lead the agent to confuse states requiring different actions. Empirical results demonstrate that DRS recovers manually curated concept sets, matches or exceeds their performance, and enhances the effectiveness of test-time concept interventions in RL benchmarks and healthcare environments.

Key Contribution

Forget hand-engineering interpretable RL agents: this algorithm automatically selects the concepts that actually matter for decision-making, with provable performance bounds.

Abstract

Training interpretable concept-based policies requires practitioners to manually select which human-understandable concepts an agent should reason with when making sequential decisions. This selection demands domain expertise, is time-consuming and costly, scales poorly with the number of candidates, and provides no performance guarantees. To overcome this limitation, we propose the first algorithms for principled automatic concept selection in sequential decision-making. Our key insight is that concept selection can be viewed through the lens of state abstraction: intuitively, a concept is decision-relevant if removing it would cause the agent to confuse states that require different actions. As a result, agents should rely on decision-relevant concepts; states with the same concept representation should share the same optimal action, which preserves the optimal decision structure of the original state space. This perspective leads to the Decision-Relevant Selection (DRS) algorithm, which selects a subset of concepts from a candidate set, along with performance bounds relating the selected concepts to the performance of the resulting policy. Empirically, DRS automatically recovers manually curated concept sets while matching or exceeding their performance, and improves the effectiveness of test-time concept interventions across reinforcement learning benchmarks and real-world healthcare environments.

Interpretability & Mechanistic Interp RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Selecting Decision-Relevant Concepts in Reinforcement Learning

Related Papers