Reward Conditioned Policies
Table of Contents
Paper: Reward Conditioned Policies - 1912.13465v1.pdf Authors: Aviral Kumar, Sergey Levine
1. Any trajectory is optimal trajectory when conditioned on matching the reward
Non-expert trajectories collected from suboptimal policies can be viewed as optimal supervision, not for maximizing the reward, but for matching the reward of the given trajectory. (pg. 1)
By then conditioning the policy on the numerical value of the reward, we can obtain a policy that generalizes to larger returns.
2. Exploration is a challenge with RCP
We expect that exploration is likely to be one of the major challenges with reward-conditioned policies: the methods we presented rely on general- ization and random chance to acquire trajectories that improve in performance over those previously seen in the dataset. Sometimes the reward-conditioned policies might generalize successfully, and sometimes they might not. [Page 9]