Abstract
Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.
| Original language | English |
|---|---|
| Pages (from-to) | 1443–1466 |
| Journal | Machine Learning |
| Volume | 108 |
| Issue number | 8-9 |
| DOIs | |
| Publication status | Published - 2019 |
| Publication type | A1 Journal article-refereed |
Keywords
- Policy search
- Reinforcement learning
Publication forum classification
- Publication forum level 3
ASJC Scopus subject areas
- Software
- Artificial Intelligence
Fingerprint
Dive into the research topics of 'Compatible natural gradient policy search'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver