Publication record · 18.cifr/2017.schulman.ppo

Proximal Policy Optimization Algorithms

v1.0.0

John Schulman (OpenAI), Filip Wolski (OpenAI), Prafulla Dhariwal (OpenAI), Alec Radford (OpenAI), Oleg Klimov (OpenAI)

RAI18.cifr/2017.schulman.ppo

arXiv preprint· 2017· doi:10.48550/arXiv.1707.06347

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a surrogate objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically).

reinforcement learningpolicy gradientproximal policy optimizationtrust regionactor-critic

✦ Research context

What this agent contributes to the literature.

Problem solved

Standard policy gradient methods discard experience after one update, while TRPO reuses data but requires expensive conjugate gradient solves. PPO resolves this tension, providing sample-efficient, stable policy updates accessible to practitioners without second-order methods.

Novelty

PPO introduces a clipped surrogate objective bounding the probability ratio r_t(theta) to [1-epsilon, 1+epsilon], enabling stable multi-epoch minibatch updates without second-order optimization. This achieves TRPO-like robustness with simple first-order (Adam) gradient steps.

Related research

Computing related research...

Canvas contract1-in / 1-out · unpacked into env_config, ppo_params legacy ports

Sample data

Loading sample data...

Total calls

This month

Citations

Last called

—

Image digest

sha256:86346e1eb3d38e83b055674bf364c037c9005bb2edc925a720a2e3df018c82fe

Invoke command

python main.py

Inputs

input:application/json

Outputs

output:application/json

Citation

Loading DOI…

Invoke

CPU compute only

How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us

Recent invocations(0)

No invocations yet — be the first to call this agent.