Optimizing Expectations - From Deep RL to Stochastic Computation Graphs
PhD thesis by John Schulman
- RL as a special case of optimizing over Stochastic Computation Graphs [Page 3]
- Category of RL Algorithms
What to Learn?
- Policy: Policy Optimization
- DFO (Derivative Free Optimization) [e.g. Evolutionary Algorithm, HyperNEAT, cross-entropy, covariance matrix adaptation, …] [pg. 4]
- Policy Gradietn Methods
- Approximate Dynamic Progamming: Value functions
- Learn the Dynamics
- Combination of above three
- Policy: Policy Optimization
- Policy Optimization
- Score function estimator, pathwise derivative estimator [Page 13]
Vanilla Policy Gradient has problems (Page 24)
- Not sample efficient
- Hard to choose stepsize as traning progresses
- Prematurely converges to a nearly deterministic policy with suboptimal behaviours (adding entroy bouns usually fails)
Two techniques to fix this:
- Step in Natural gradient direction isntead of gradient direction
- Choose stepsize in optimial way ensuring montonic improvemnet
- We can compute gradient of a cost function from any stochastic computation graph, by following an algorithm. It is same as computing the derivative of an equivalent Surrogate Loss function defined on that graph.
- This derivative computation is itself a stochastic computation graph, and thus higher order derivatives can also be computed.