Date: <2024-09-14 Sat>

Reward Conditioned Policies

Table of Contents

1. Any trajectory is optimal trajectory when conditioned on matching the reward
2. Exploration is a challenge with RCP

Paper: Reward Conditioned Policies - 1912.13465v1.pdf Authors: Aviral Kumar, Sergey Levine

1. Any trajectory is optimal trajectory when conditioned on matching the reward

Non-expert trajectories collected from suboptimal policies can be viewed as optimal supervision, not for maximizing the reward, but for matching the reward of the given trajectory. (pg. 1)

By then conditioning the policy on the numerical value of the reward, we can obtain a policy that generalizes to larger returns.

2. Exploration is a challenge with RCP

We expect that exploration is likely to be one of the major challenges with reward-conditioned policies: the methods we presented rely on general- ization and random chance to acquire trajectories that improve in performance over those previously seen in the dataset. Sometimes the reward-conditioned policies might generalize successfully, and sometimes they might not. [Page 9]

References

https://arxiv.org/abs/1912.13465

Backlinks

You can send your feedback, queries here