Advertisement

A Unified Game-Theoretic Approach to Multi-agent Reinforcement Learning

阅读量:
在这里插入图片描述

In this session, we will conduct a detailed exploration of a foundational paper titled A Unified Game-Theoretic Approach to Multi-agent Reinforcement Learning, which has played a pivotal role in shaping the development of AlphaStar. While AlphaStar encompasses a variety of intricate concepts, this session will focus specifically on delving into key principles rooted in the Nash equilibrium framework’s conceptual foundation and examining how game theory integrates with reinforcement learning techniques.

At the conclusion of this article, you should acquire an understanding of the Double Oracle algorithm, the Deep Cognitive Hierarchies, and the Policy-Space Response Oracles. You should also understand these concepts.

To gain an understanding of this post, you should be acquainted with certain fundamental concepts in game theory, such as the definition of a strategic game structured as a payoff matrix, the concept of Nash Equilibria, and the idea of best responses. Additionally, you are encouraged to explore various conceptual implementations provided here, including a Python implementation using numpy that offers further insights into these topics.

Why a Multi-agent context ?

In #AlphaStar, a multi-agent setup was developed to enhance outcomes related to strategic decision-making through self-playing. This implies that the strategy has evolved from a population-based multi-agent system, evolving through interaction with other agents via reinforcement learning. MARL relies on iterative improvements derived from analyzing approximate best responses generated by mixtures of policies employing deep reinforcement learning techniques.

To delve into multi-agent RL, it is also important to note the concept of a normal-form game within game theory: a tuple (π, U, n) is defined for n representing the number of players, U representing payoff utility functions and associated policies for each player π.

Double Oracle Algorithm

Among the various approaches, a representative algorithm and its generalization embody the fundamental foundations in analyzing the reinforcement learning mechanisms within multi-agent systems.

该算法通过迭代方式逐步计算子博弈的收益矩阵... 该算法通过深度神经网络实现子博弈G_t的收益矩阵表示。具体而言,在每次时间步长t处,系统会计算出一个均衡响应\sigma以及一个最佳策略\pi(即纳什均衡),这两者将共同决定G_{t+1}的状态转移关系。为了在收益矩阵中进行函数近似,我们计划使用深度神经网络。

在这里插入图片描述

Visual Explanation of the DO Algorithm. The authors of the DO algorithm are Bosansky, Lisky, Cermak, Vitek, and Pechoucek. An important consideration is that an approximate best response is computed rather than the exact one; this approach ensures computational feasibility while yielding satisfactory outcomes.

To address both aspects of generalization and scalability, we refer to these two variants as Policy-Space Response Oracles (PSRO) and Deep Cognitive Hierarchies (DCH).

Policy-Space Response Oracles (PSRO)

The algorithm represents an extension to Double Oracle, where in each iteration, decisions are made based on selecting from available strategies rather than individual actions. In each case, we first select an initial policy, compute to determine its utility function, and then establish the meta-strategy as a function dependent on the distribution of these selected policies.

The meta-game begins with a single policy and grows incrementally through epochs. It incorporates sub-routines, known as oracles, which simulate responses to the strategies employed by other players.

However, this algorithm faces significant scalability challenges. Specifically, within every epoch, for every player and episode, a computation occurs to determine both the new policy π and the meta-strategy based on each individual player's data. Instead of striving to calculate the exact optimal response, an approximate optimal strategy employing reinforcement learning (RL) is utilized.

在这里插入图片描述

该正规形式游戏可由(政策、效用函数及参与玩家数量)三者构成。当输入为各玩家的策略时(INPUT),输出则为每个玩家的解决方案策略(OUTPUT)。
从整体来看,在强化学习框架下,算法通过从多玩家中抽取多个策略,并运用Double Oracle算法进行优化。最终输出结果构成了基于不同策略分布的综合评价。

Deep Cognitive Hierarchies (DCH)

DCH近似于PSRO。为了提高可扩展性而牺牲了准确性。强化学习步骤需要很长时间才能收敛到一个满意的结果。因此,在一般化方面是否需要一种并行形式的PSRO?即在预先选择k个层级的基础上:每个层级训练一个单一的元政策并同时运行多个过程,并在磁盘上定期保存这些过程。除了上述提到的强化学习步骤外

在这里插入图片描述

To prevent overtraining during the learning phase, certain joint policy correlation matrices are computed using distinct seed values to initialize the random number generators.

Conclusions and further readings :

Thanks for reaching this point! By now, you should be acquainted with the Double Oracle Algorithm, the policy-Space Response Oracles, and Deep Cognitive Hierarchies. Some concepts like meta-solvers were excluded from the article not because they were deemed unnecessary, but as they reduce algorithm big O.

Additional readings within the framework of a Generalized Method for Empirical Game Theoretic Analysis and asymmetric multi-agent reinforcement learning models could potentially contribute new insights into the identification of asymmetric Nash equilibrium solutions within pure strategy spaces.

全部评论 (0)

还没有任何评论哟~