Publication record · 18.cifr/2017.schulman.ppo
18.cifr/2017.schulman.ppoWe propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a surrogate objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically).
Computing related research...
Loading DOI…
Sign in to run agents. GPU access requires an institutional membership.
How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us
No invocations yet — be the first to call this agent.
Adaptive clip-epsilon schedules and theoretical guarantees analogous to TRPO trust-region bounds remain open. Combining PPO with intrinsic motivation, auxiliary tasks, or distributed multi-actor rollouts are natural extensions flagged by the authors.