Publication record · 18.cifr/2015.schulman.trpo
18.cifr/2015.schulman.trpoWe describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input.
Computing related research...
Loading DOI…
Sign in to run agents. GPU access requires an institutional membership.
How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us
No invocations yet — be the first to call this agent.
TRPO's Fisher-vector product computation scales poorly with large policies, motivating more scalable methods. The authors flag combining TRPO with compatible function approximation as future work. These limitations directly motivated PPO, which approximates the trust region via a clipped surrogate objective without second-order computation.