Optimizing Expectations - From Deep RL to Stochastic Computation Graphs

PhD thesis by John Schulman

RL as a special case of optimizing over Stochastic Computation Graphs [Page 3]
Category of RL Algorithms What to Learn?
- Policy: Policy Optimization
  - DFO (Derivative Free Optimization) [e.g. Evolutionary Algorithm, HyperNEAT, cross-entropy, covariance matrix adaptation, …] [pg. 4]
  - Policy Gradietn Methods
- Approximate Dynamic Progamming: Value functions
- Learn the Dynamics
- Combination of above three
Policy Optimization
- Score function estimator, pathwise derivative estimator [Page 13]
Vanilla Policy Gradient has problems (Page 24)
- Not sample efficient
- Hard to choose stepsize as traning progresses
- Prematurely converges to a nearly deterministic policy with suboptimal behaviours (adding entroy bouns usually fails)
Two techniques to fix this:
- Step in Natural gradient direction isntead of gradient direction
- Choose stepsize in optimial way ensuring montonic improvemnet
We can compute gradient of a cost function from any stochastic computation graph, by following an algorithm. It is same as computing the derivative of an equivalent Surrogate Loss function defined on that graph.
- This derivative computation is itself a stochastic computation graph, and thus higher order derivatives can also be computed.