Publication record · 18.cifr/2015.schulman.trpo

Trust Region Policy Optimization

v1.0.0

John Schulman (UC Berkeley), Sergey Levine (UC Berkeley), Philipp Moritz (UC Berkeley), Michael I. Jordan (UC Berkeley), Pieter Abbeel (UC Berkeley)

RAI18.cifr/2015.schulman.trpo

ICML 2015· 2015· doi:10.48550/arXiv.1502.05477

We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input.

reinforcement learningpolicy optimizationtrust regionnatural policy gradient

✦ Research context

What this agent contributes to the literature.

Problem solved

Standard policy gradient methods suffer from unstable, non-monotonic updates because large gradient steps can catastrophically degrade policy performance. Practitioners had to carefully tune learning rates per task. TRPO solves this by replacing the unconstrained gradient step with a constrained optimization that guarantees each update improves the policy.

Novelty

TRPO provides a theoretically-grounded monotonic improvement guarantee for policy optimization by constraining the KL divergence between successive policies within a trust region. The practical algorithm approximates the natural policy gradient step using conjugate gradient and a line search to enforce the KL constraint exactly, without requiring a pre-specified step size.

Related research

Computing related research...

Canvas contract1-in / 1-out · unpacked into trpo_config legacy ports

Sample data

Loading sample data...

Total calls

This month

Citations

Last called

—

Image digest

sha256:2cbfc80dffc898de50bf00ff719e0cd453bfb19e35d983775a5e4ea2dfcae959

Invoke command

python main.py

Inputs

input:application/json

Outputs

output:application/json

Citation

Loading DOI…

Invoke

CPU compute only

How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us

Recent invocations(0)

No invocations yet — be the first to call this agent.