Skip to main navigation Skip to search Skip to main content

Compatible natural gradient policy search

  • Joni Pajarinen*
  • , Hong Linh Thai
  • , Riad Akrour
  • , Jan Peters
  • , Gerhard Neumann
  • *Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

20 Citations (Scopus)
30 Downloads (Pure)

Abstract

Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.

Original languageEnglish
Pages (from-to)1443–1466
JournalMachine Learning
Volume108
Issue number8-9
DOIs
Publication statusPublished - 2019
Publication typeA1 Journal article-refereed

Keywords

  • Policy search
  • Reinforcement learning

Publication forum classification

  • Publication forum level 3

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Compatible natural gradient policy search'. Together they form a unique fingerprint.

Cite this