Supervised Learning for RL

1. Reward Conditioned Policies
- 1.1. Any trajectory is optimal trajectory when conditioned on matching the reward
- 1.2. Exploration is a challenge with RCP
2. Goal-Conditioned Supervised Learning
- 2.1. Any trajectory is optimal if the goal is the final state of trajectory
- 2.2. Comparision with HER
3. Solving Offline Reinforcement Learning with Decision Tree Regression

1. Reward Conditioned Policies

Paper: Reward Conditioned Policies - 1912.13465v1.pdf Authors: Aviral Kumar, Sergey Levine

1.1. Any trajectory is optimal trajectory when conditioned on matching the reward

Non-expert trajectories collected from suboptimal policies can be viewed as optimal supervision, not for maximizing the reward, but for matching the reward of the given trajectory. (pg. 1)

By then conditioning the policy on the numerical value of the reward, we can obtain a policy that generalizes to larger returns.

1.2. Exploration is a challenge with RCP

We expect that exploration is likely to be one of the major challenges with reward-conditioned policies: the methods we presented rely on general- ization and random chance to acquire trajectories that improve in performance over those previously seen in the dataset. Sometimes the reward-conditioned policies might generalize successfully, and sometimes they might not. [Page 9]

2. Goal-Conditioned Supervised Learning

Authors: Dibya Ghosh, Benjamin Eysenbach, Sergey Levine

2.1. Any trajectory is optimal if the goal is the final state of trajectory

Any trajectory is a successful demonstration for reaching the final state in that same trajectory. (pg. 1)

2.2. Comparision with HER

GCSL is different from Hindsight Experience Replay. (See 00:10:33 Comparision with HER)

	HER	GCSL
Is the Goal in the Trajectory?	NO	YES
Uses TD Learning?	YES	NO

Goal from Trajectory?
- Given a transition HER creates a fictitious transition by choosing an arbitrary goal and updating the reward as per the goal. The goal doesn't have to be in the trajectory
- 00:10:57 GCSL only relables the transition goal to be the final state of the trajectory
TD Learning? 00:11:21
- HER uses TD Learning (for learning value function) which is unstable
- GCSL directly learns policy using Supervised Learning: Imitation Learning is stable
  
  So even if we replace the goal in HER to be terminal state of trajectory, learning value function is not as stable as learning policy directly

3. Solving Offline Reinforcement Learning with Decision Tree Regression

nil

Supervised Learning for RL

Backlinks

Reinforcement Learning

Supervised Learning for RL

Table of Contents

1. Reward Conditioned Policies

1.1. Any trajectory is optimal trajectory when conditioned on matching the reward

1.2. Exploration is a challenge with RCP

2. Goal-Conditioned Supervised Learning

2.1. Any trajectory is optimal if the goal is the final state of trajectory

2.2. Comparision with HER

3. Solving Offline Reinforcement Learning with Decision Tree Regression

Backlinks