TY - GEN
T1 - Generalized mean estimation in monte-carlo tree search
AU - Dam, Tuan
AU - Klink, Pascal
AU - D'Eramo, Carlo
AU - Peters, Jan
AU - Pajarinen, Joni
N1 - Funding Information:
This project has received funding from SKILLS4ROBOTS, project reference #640554, and, by the German Research Foundation project PA 3179/1-1 (ROBOLEAP), and from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. #713010 (GOAL-Robots).
Publisher Copyright:
© 2020 Inst. Sci. inf., Univ. Defence in Belgrade. All rights reserved.
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
jufoid=58173
PY - 2020
Y1 - 2020
N2 - We consider Monte-Carlo Tree Search (MCTS) applied to Markov Decision Processes (MDPs) and Partially Observable MDPs (POMDPs), and the well-known Upper Confidence bound for Trees (UCT) algorithm. In UCT, a tree with nodes (states) and edges (actions) is incrementally built by the expansion of nodes, and the values of nodes are updated through a backup strategy based on the average value of child nodes. However, it has been shown that with enough samples the maximum operator yields more accurate node value estimates than averaging. Instead of settling for one of these value estimates, we go a step further proposing a novel backup strategy which uses the power mean operator, which computes a value between the average and maximum value. We call our new approach Power-UCT, and argue how the use of the power mean operator helps to speed up the learning in MCTS. We theoretically analyze our method providing guarantees of convergence to the optimum. Finally, we empirically demonstrate the effectiveness of our method in well-known MDP and POMDP benchmarks, showing significant improvement in performance and convergence speed w.r.t. state of the art algorithms.
AB - We consider Monte-Carlo Tree Search (MCTS) applied to Markov Decision Processes (MDPs) and Partially Observable MDPs (POMDPs), and the well-known Upper Confidence bound for Trees (UCT) algorithm. In UCT, a tree with nodes (states) and edges (actions) is incrementally built by the expansion of nodes, and the values of nodes are updated through a backup strategy based on the average value of child nodes. However, it has been shown that with enough samples the maximum operator yields more accurate node value estimates than averaging. Instead of settling for one of these value estimates, we go a step further proposing a novel backup strategy which uses the power mean operator, which computes a value between the average and maximum value. We call our new approach Power-UCT, and argue how the use of the power mean operator helps to speed up the learning in MCTS. We theoretically analyze our method providing guarantees of convergence to the optimum. Finally, we empirically demonstrate the effectiveness of our method in well-known MDP and POMDP benchmarks, showing significant improvement in performance and convergence speed w.r.t. state of the art algorithms.
U2 - 10.24963/ijcai.2020/332
DO - 10.24963/ijcai.2020/332
M3 - Conference contribution
AN - SCOPUS:85097353204
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 2397
EP - 2404
BT - Proceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI 2020
A2 - Bessiere, Christian
PB - International Joint Conferences on Artificial Intelligence
T2 - International Joint Conference on Artificial Intelligence
Y2 - 1 January 2021
ER -