Advertisement

【综述】【合作式多智能体深度强化学习研究进展】

阅读量:

A Review of Cooperative Multi-Agent Deep Reinforcement Learning

合作式多智能体深度强化学习研究进展

https://arxiv.org/abs/1908.03963

Abstract 摘要

Deep Reinforcement Learning has made significant progress in multi-agent systems in recent years. In this review article, we have focused on presenting recent approaches on Multi-Agent Reinforcement Learning (MARL) algorithms. In particular, we have focused on five common approaches on modeling and solving cooperative multi-agent reinforcement learning problems: (I) independent learners, (II) fully observable critic, (III) value function factorization, (IV) consensus, and (IV) learn to communicate. First, we elaborate on each of these methods, possible challenges, and how these challenges were mitigated in the relevant papers. If applicable, we further make a connection among different papers in each category. Next, we cover some new emerging research areas in MARL along with the relevant recent papers. Due to the recent success of MARL in real-world applications, we assign a section to provide a review of these applications and corresponding articles. Also, a list of available environments for MARL research is provided in this survey. Finally, the paper is concluded with proposals on the possible research directions.
近年来,深度强化学习在多智能体系统中取得了重大进展。在这篇综述文章中,我们重点介绍了多智能体强化学习(MARL)算法的最新方法。特别是,我们重点介绍了五种关于建模和解决合作多智能体强化学习问题的常见方法:(I)独立学习者,(II)完全可观察的批评家,(III)价值函数分解,(IV)共识,以及(IV)学习沟通。首先,我们在相关论文中详细阐述了这些方法、可能的挑战以及如何缓解这些挑战。如果适用,我们会进一步在每个类别的不同论文之间建立联系。接下来,我们将介绍MARL中一些新兴的研究领域以及相关的近期论文。由于 MARL 最近在实际应用中取得了成功,我们分配了一个部分来提供对这些应用和相应文章的回顾。此外,本调查还提供了 MARL 研究的可用环境列表。最后,对可能的研究方向提出了建议。

Keywords: Reinforcement Learning, Multi-agent systems, Cooperative.
关键词:强化学习,多智能体系统,合作。

1 Introduction

1 引言

Multi-Agent Reinforcement Learning (MARL) algorithms are dealing with systems consisting of several agents (robots, machines, cars, etc.) which are interacting within a common environment. Each agent makes a decision in each time-step and works along with the other agent(s) to achieve an individual predetermined goal. The goal of MARL algorithms is to learn a policy for each agent such that all agents together achieve the goal of the system. Particularly, the agents are learnable units that aim to learn an optimal policy on the fly to maximize the long-term cumulative discounted reward through the interaction with the environment. Due to the complexities of the environments or the combinatorial nature of the problem, training the agents is typically a challenging task and several problems which MARL deals with them are categorized as NP-Hard problems, e.g. manufacturing scheduling (Gabel and Riedmiller 2007, Dittrich and Fohlmeister 2020), vehicle routing problem (Silva et al. 2019, Zhang et al. 2020b), some multi-agent games (Bard et al. 2020) are only a few examples to mention.
多智能体强化学习 (MARL) 算法 处理由多个智能体(机器人、机器、汽车等) 组成的系统,这些智能体在公共环境中进行交互。每个智能体在每个时间步长中做出决定,并与其他智能体一起工作以实现单个预定目标。MARL 算法的目标是为每个智能体学习策略,以便所有智能体共同实现系统的目标。特别是,智能体是可学习的单位,旨在动态学习最佳策略,通过与环境的相互作用实现长期累积折扣奖励的最大化。由于环境的复杂性或问题的组合性质,训练智能体通常是一项具有挑战性的任务,MARL 处理的几个问题被归类为 NP-Hard 问题,例如制造调度(Gabel 和 Riedmiller 2007,Dittrich 和 Fohlmeister 2020),车辆路线问题(Silva 等人,2019 年,Zhang 等人,2020 年b), 一些多智能体游戏(Bard 等人,2020 年)只是其中的几个例子。

With the motivation of recent success on deep reinforcement learning (RL)—super-human level control on Atari games (Mnih et al. 2015), mastering the game of Go (Silver et al. 2016), chess (Silver et al. 2017), robotic (Kober et al. 2013), health care planning (Liu et al. 2017), power grid (Glavic et al. 2017), routing (Nazari et al. 2018), and inventory optimization (Oroojlooyjadid et al.)—on one hand, and the importance of multi-agent system (Wang et al. 2016b, Leibo et al. 2017) on the other hand, several researches have been focused on deep MARL. One naive approach to solve these problems is to convert the problem to a single-agent problem and make the decision for all the agents using a centralized controller. However, in this approach, the number of actions typically exponentially increases, which makes the problem intractable. Besides, each agent needs to send its local information to the central controller and with increasing the number of agents, this approach becomes very expensive or impossible. In addition to the communication cost, this approach is vulnerable to the presence of the central unit and any incident that results in the loss of the network. Moreover, usually in multi-agent problems, each agent accesses only some local information, and due to privacy issues, they may not be allowed to share their information with the other agents.
随着最近在深度强化学习 (RL) 方面取得成功的动机——对 Atari 游戏的超人水平控制(Mnih 等人,2015 年),掌握围棋游戏(Silver 等人,2016 年)、国际象棋(Silver 等人,2017 年)、机器人(Kober 等人,2013 年)、医疗保健计划(Liu 等人,2017 年)、电网(Glavic 等人,2017 年)、路由(Nazari 等人,2018 年)、 和库存优化(Oroojlooyjadid 等人)一方面,多智能体系统的重要性(Wang et al. 2016b, Leibo et al. 2017) 另一方面,一些研究都集中在深度 MARL 上。解决这些问题的一种幼稚方法是将问题转换为单代理问题,并使用集中式控制器为所有代理做出决策。但是,在这种方法中,操作的数量通常呈指数级增长,这使得问题变得棘手。此外,每个代理都需要将其本地信息发送到中央控制器,并且随着代理数量的增加,这种方法变得非常昂贵或不可能。除了通信成本外,这种方法还容易受到中央单元的存在以及导致网络丢失的任何事件的影响。此外,通常在多智能体问题中,每个智能体只访问一些本地信息,并且由于隐私问题,他们可能不被允许与其他智能体共享他们的信息。

There are several properties of the system that is important in modeling a multi-agent system: (i) centralized or decentralized control, (ii) fully or partially observable environment, (iii) cooperative or competitive environment. Within a centralized controller, a central unit takes the decision for each agent in each time step. On the other hand, in the decentralized system, each agent takes a decision for itself. Also, the agents might cooperate to achieve a common goal, e.g. a group of robots who want to identify a source or they might compete with each other to maximize their own reward, e.g. the players in different teams of a game. In each of these cases, the agent might be able to access the whole information and the sensory observation (if any) of the other agents, or on the other hand, each agent might be able to observe only its local information. In this paper, we have focused on the decentralized problems with the cooperative goal, and most of the relevant papers with either full or partial observability are reviewed. Note that Weiß (1995), Matignon et al. (2012), Buşoniu et al. (2010), Bu et al. (2008) provide reviews on cooperative games and general MARL algorithms published till 2012. Also, Da Silva and Costa (2019) provide a survey over the utilization of transfer learning in MARL. Zhang et al. (2019b) provide a comprehensive overview on the theoretical results, convergence, and complexity analysis of MARL algorithms on Markov/stochastic games and extensive-form games on competitive, cooperative, and mixed environments. In the cooperative setting, they have mostly focused on the theory of consensus and policy evaluation. In this paper, we did not limit ourselves to a given branch of cooperative MARL such as consensus, and we tried to cover most of the recent works on the cooperative Deep MARL. In Nguyen et al. (2020), a review is provided for MARL, where the focus is on deep MARL from the following perspectives: non-stationarity, partial observability, continuous state and action spaces, training schemes, and transfer learning. We provide a comprehensive overview of current research directions on the cooperative MARL under six categories, and we tried our best to unify all papers through a single notation. Since the problems that MARL algorithms deal with, usually include large state/action spaces and, the classical tabular RL algorithms are not efficient to solve them, we mostly focus on the approximated cooperative MARL algorithms.
在多智能体系统建模中,该系统有几个重要的属性:(i)集中式或分散式控制,(ii)完全或部分可观察的环境,(iii)合作或竞争环境。在集中式控制器中,中央单元在每个时间步长中为每个代理做出决策。另一方面,在去中心化系统中,每个代理都为自己做出决定。此外,智能体可能会合作以实现共同目标,例如,一组想要识别来源的机器人,或者它们可能会相互竞争以最大化自己的奖励,例如游戏不同团队的玩家。在每种情况下,智能体都可能能够访问其他智能体的全部信息和感官观察(如果有的话),或者另一方面,每个智能体可能只能观察其局部信息。在本文中,我们重点研究了具有合作目标的去中心化问题,并回顾了大多数具有全部或部分可观测性的相关论文。请注意,Weiß ( 1995), Matignon et al. ( 2012), Buşoniu et al. ( 2010), Bu et al. ( 2008) 提供了截至 2012 年发表的合作博弈和通用 MARL 算法的评论。此外,Da Silva 和 Costa (2019) 提供了一项关于迁移学习在 MARL 中的利用的调查。Zhang et al. ( 2019b) 全面综述了 MARL 算法在竞争、合作和混合环境中的 Markov/随机博弈和广义博弈的理论结果、收敛性和复杂性分析。在合作环境中,他们主要关注共识理论和政策评估。 在本文中,我们没有将自己局限于合作 MARL 的给定分支,例如共识,我们试图涵盖最近关于合作深 MARL 的大部分工作。在 Nguyen et al. ( 2020) 中,对 MARL 进行了回顾,其中重点从以下角度关注深度 MARL:非平稳性、部分可观察性、连续状态和动作空间、训练方案和迁移学习。我们全面概述了目前合作MARL的六大类研究方向,并尽力通过一个符号统一所有论文。由于 MARL 算法处理的问题通常包括较大的状态/动作空间,并且经典的表格 RL 算法无法有效地解决这些问题,因此我们主要关注近似的合作 MARL 算法。

The rest of the paper is organized as the following: in section 2, we discuss the taxonomy and organization of the MARL algorithms we reviewed. in Section 3.1 we briefly explain the single agent RL problem and some of its components. Then the multi-agent formulation is represented and some of the main challenges of multi-agent environment from the RL viewpoint is described in Section 3.2. Section 4 explains the independent Q-learner type algorithm, Section 5 reviews the papers with a fully observable critic model, Section 6 includes the value decomposition papers, Section 7 explains the consensus approach, Section 8 reviews the learn-to-communicate approach, Section 9 explains some of the emerging research directions, Section 10 provides some applications of the multi-agent problems and MARL algorithms in real-world, Section 11 very briefly mentions some of the available multi-agent environments, and finally Section 13 concludes the paper.
本文的其余部分组织如下:在第 2 节中,我们讨论了我们回顾的 MARL 算法的分类和组织。在第 3.1 节中,我们简要解释了单智能体 RL 问题及其一些组成部分。然后表示了多智能体公式,并在第 3.2 节中从 RL 的角度描述了多智能体环境的一些主要挑战。第 4 节解释了独立的 Q-learner 类型算法,第 5 节用完全可观察的批评模型回顾了论文,第 6 节包括价值分解论文,第 7 节解释了共识方法,第 8 节回顾了学习交流方法,第 9 节解释了一些新兴的研究方向,第 10 节提供了多智能体问题和 MARL 算法在现实世界中的一些应用, 第 11 节非常简短地提到了一些可用的多智能体环境,最后第 13 节总结了本文。

2 Taxonomy

2 分类

In this section, we provide a high-level explanation about the taxonomy and the angle we looked at the MARL.
在本节中,我们将对分类法和我们看待 MARL 的角度进行高层次的解释。

A simple approach to extend single-agent RL algorithms to multi-agent algorithms is to consider each agent as an independent learner. In this setting, the other agents’ actions would be treated as part of the environment. This idea was formalized in Tan (1993) for the first time where the Q-learning algorithm was extended for this problem which is called independent Q-Learning (IQL). The biggest challenge for IQL is non-stationarity , as the other agents’ actions toward local interests will impact the environment transitions.
将单智能体 RL 算法扩展到多智能体算法的一种简单方法是将每个智能体视为一个独立的学习者。在此设置中,其他代理的操作将被视为环境的一部分。这个想法在Tan(1993)中首次正式提出,其中Q学习算法被扩展为这个问题,称为独立Q学习(IQL)。IQL面临的最大挑战是非平稳性,因为其他主体对当地利益的行动将影响环境的转变。

To address the non-stationarity issue, one strategy is to assume that all critics observe the same state (global state) and actions of all agents, which we call it fully observable critic model. In this setting, the critic model learns the true state-value and when is paired with the actor can be used toward finding the optimal policy. When the reward is shared among all agents, only one critic model is required; however, in the case of private local reward, each agent needs to train a local critic model for itself.
为了解决非平稳性问题,一种策略是假设所有批评者都观察到所有智能体的相同状态(全局状态)和行为,我们称之为完全可观察的批评家模型。在此设置中,批评者模型学习真正的状态值,当与参与者配对时,可用于找到最佳策略。当奖励在所有代理之间共享时,只需要一个评论家模型;然而,在私人本地奖励的情况下,每个代理都需要为自己训练一个本地评论家模型。

Consider multi-agent problem settings, where the agents aim to maximize a single joint reward or assume it is possible to reduce the multi-agent problem into a single agent problem. A usual RL algorithm may fail to find the global optimal solution in this simplified setting. The main reason is that in such settings, the agents do not know the true share of the reward for their actions and as a result, some agents may get lazy over time (Sunehag et al. 2018). In addition, exploration among the agents with poor policy aggravates the team reward, thus restrain the agents with good policies to proceed toward optimal policy. One idea to remedy this issue is to figure out the share of each individual agent into the global reward. This solution is formalized as the Value Function Factorization , where a decomposition function is learned from the global reward.
考虑多智能体问题设置,其中智能体旨在最大化单个联合奖励,或者假设可以将多智能体问题简化为单个智能体问题。在这种简化的设置中,通常的 RL 算法可能无法找到全局最优解。主要原因是,在这种情况下,代理人不知道他们行为的真正奖励份额,因此,随着时间的推移,一些代理人可能会变得懒惰(Sunehag 等人,2018 年)。此外,策略不佳的智能体之间的探索会加剧团队奖励,从而抑制策略好的智能体向最优策略前进。解决这个问题的一个想法是弄清楚每个代理在全球奖励中的份额。该解决方案被形式化为值函数分解,其中从全局奖励中学习分解函数。

Another drawback in the fully observable critic paradigm is the communication cost. In particular, with increasing the number of learning agents, it might be a prohibitive task to collect all state/action information in a critic due to communication bandwidth and memory limitations. The same issue occurs for the actor when the observation and actions are being shared. Therefore, the question is how to address this communication issue and change the topology such that the local agents can cooperate and communicate in learning optimal policy. The key idea is to put the learning agents on a sparsely connected network, where each agent can communicate with a small subset of agents. Then the agents seek an optimal solution under the constraint that this solution is in consensus with its neighbors. Through communications, eventually, the whole network reaches a unanimous policy which results in the optimal policy.
完全可观察的批评家范式的另一个缺点是沟通成本。特别是,随着学习代理数量的增加,由于通信带宽和内存限制,在批评者中收集所有状态/操作信息可能是一项令人望而却步的任务。在共享观察和操作时,参与者也会出现同样的问题。因此,问题是如何解决这个通信问题并改变拓扑结构,以便本地代理可以在学习最佳策略时进行合作和通信。关键思想是将学习代理放在一个稀疏连接的网络上,每个代理都可以与一小部分代理进行通信。然后,智能体在约束下寻求最优解,该解与其邻居一致。通过通信,最终,整个网络达成一致的策略,从而产生最佳策略。

For the consensus algorithms or the fully observable critic model, it is assumed that the agent can send their observation, action, or rewards to the other agents and the hope is that they can learn the optimal policy by having that information from other agents. But, one does not know the true information that is required for the agent to learn the optimal policy. In other words, the agent might be able to learn the optimal policy by sending and receiving a simple message instead of sending and receiving the whole observation, action, and reward information. So, another line of research which is called, Learn to Communicate , allows the agents to learn what to send, when to send, and send that to which agents. In particular, besides the action to the environment, the agents learn another action called communication action.
对于共识算法或完全可观察的批评者模型,假设智能体可以将他们的观察、操作或奖励发送给其他智能体,并希望他们可以通过从其他智能体那里获得这些信息来学习最佳策略。但是,人们不知道代理学习最佳策略所需的真实信息。换言之,智能体可能能够通过发送和接收简单消息来学习最佳策略,而不是发送和接收整个观察、操作和奖励信息。因此,另一条研究线称为“学习沟通”,它允许代理了解发送什么,何时发送,并将其发送给哪些代理。特别是,除了对环境的操作之外,智能体还学习了另一个称为通信操作的动作。

Despite the fact that the above taxonomy covers a big portion of MARL, there are still some algorithms that either do not fit in any of these categories or are at the intersection of a few of them. In this review, we discuss some of these algorithms too.
尽管上述分类法涵盖了 MARL 的很大一部分,但仍有一些算法不适合这些类别中的任何一个,或者处于其中一些类别的交叉点。在这篇评论中,我们也讨论了其中的一些算法。

In Table 2, a summary of different categories are presented. Notice that in the third column we provide only a few representative references. More papers will be discussed in the following sections.
在表2中,列出了不同类别的摘要。请注意,在第三列中,我们只提供了一些具有代表性的参考文献。更多论文将在以下各节中讨论。

Table 1: A summary of the MARL taxonomy in this paper.
表 1:本文中 MARL 分类法的摘要。

3 Background, Single-Agent RL Formulation, and Multi-Agent RL Notation

3 背景、单智能体 RL 构建和多单智能体 RL 表示

In this section, we first go over some background of reinforcement learning and the common approaches to solve that for the single-agent problem in Section 3.1. Then, in Section 3.2 we introduce the notation and definition of the Multi-agent sequential decision-making problem and the challenges that MARL algorithms need to address.
在本节中,我们首先介绍强化学习的一些背景,以及解决第 3.1 节中单智能体问题的常用方法。然后,在第 3.2 节中,我们介绍了多智能体顺序决策问题的符号和定义以及 MARL 算法需要解决的挑战。

3.1 Single Agent RL

3.1 单智能体RL



For all the above notations and descriptions, we assume full observability of the environment. However, in cases that the agent accesses only some part of the state, it can be categorized as a decision-making problem with partial observability. In such circumstances, MDP can no longer be used to model the problem; instead, partially observable MDPs (POMDP) is introduced as the modeling framework. This situation happens in a lot of multi-agent systems and will be discussed throughout the paper.
对于上述所有符号和描述,我们假设环境具有完全的可观察性。但是,如果代理仅访问状态的某些部分,则可以将其归类为具有部分可观测性的决策问题。在这种情况下,MDP 不能再用于对问题进行建模;取而代之的是,引入了部分可观察的 MDP (POMDP) 作为建模框架。这种情况发生在许多多智能体系统中,并将在整篇论文中讨论。

3.1.1 Value Approximation
3.1.1 值近似




3.1.2 Policy Approximation
3.1.2 策略近似



Following this technique, several AC-based methods were proposed. Asynchronous advantage actor-critic (A3C) contains a master node connected to a few worker nodes (Mnih et al. 2016). This algorithm runs several instances of the actor-critic model and asynchronously gathers the gradients to update the weights of a master node. Afterward, the master node broadcasts the new weights to the worker node, and in this way, all nodes are updated asynchronously. Synchronous advantage actor-critic (A2C) algorithm uses the same framework but synchronously updates the weights.
根据这种技术,提出了几种基于交流的方法。异步优势参与者-批评者 (A3C) 包含一个连接到几个工作节点的主节点(Mnih 等人,2016 年)。此算法运行 actor-critic 模型的多个实例,并异步收集梯度以更新主节点的权重。之后,主节点将新的权重广播到工作节点,这样,所有节点都会异步更新。同步优势参与者-评论家 (A2C) 算法使用相同的框架,但同步更新权重。


Despite the issue of data efficiency, policy-based algorithms provide better convergence guarantees over the value-based algorithms (Yang et al. 2018b, Zhang et al. 2020a, Agarwal et al. 2020). This is still true with the policy gradient which utilizes neural networks as function approximation (Liu et al. 2019, Wang et al. 2020b). In addition, compared to the value-based algorithms, policy-based approaches can be easily applied to the continuous control problem. Furthermore, for most problems, we do not know the true form of the optimal policy, i.e., deterministic or stochastic. The policy gradient has the ability to learn either a stochastic or deterministic policy; however, in value-based algorithms, one needs to know the form of the policy at the algorithm’s design time, which might be unknown. This results in two benefits of the policy-gradient method over value-based methods (Sutton and Barto 2018): (i) when the optimal policy is a stochastic policy (like Tic-tac-toe game), policy gradient by nature is able to learn that. However, the value-based algorithms have no way of learning the optimal stochastic policy. (ii) if the optimal policy is deterministic, by following the policy gradient algorithms, there is a chance of converging to a deterministic policy. However, with a value-based algorithm, one does not know the true form of the optimal policy so that he cannot choose the optimal exploration parameter (like e in e greedy method) to be used in the scoring time. In the first benefit, note that one may use a softmax operator over the Q-value to provide the probabilities for choosing each action; but, the value-based algorithms cannot learn the probabilities by themselves as a stochastic policy. Similarly, one may choose a non-zero e for the score time, but one does not know the optimal value for such e, so this method may not result in the optimal policy. Also, on the second benefit note that this issue is not limited to the e-greedy algorithm. With added soft-max operator to a value-based algorithm, we get the probability of choosing each action. Even in this setting, the algorithm is designed to get the true values for each action, and there is no known mapping of true values to the optimal probabilities for choosing each action, which does not necessarily result in 0 and 1 actions. Similarly, the other variants of the soft-max operator like Boltzmann softmax which uses a temperature parameter, do not help either. Although, the temperature parameter can help to get determinism; still in practice we do not if the optimal solution is deterministic to do that.
尽管存在数据效率问题,但基于策略的算法比基于价值的算法提供了更好的收敛保证(Yang et al. 2018b, Zhang et al. 2020a, Agarwal et al. 2020)。对于利用神经网络作为函数近似的策略梯度,情况仍然如此(Liu et al. 2019, Wang et al. 2020b)。此外,与基于价值的算法相比,基于策略的方法可以很容易地应用于连续控制问题。此外,对于大多数问题,我们不知道最优策略的真实形式,即确定性或随机性。策略梯度能够学习随机策略或确定性策略;但是,在基于值的算法中,需要在算法设计时知道策略的形式,这可能是未知的。与基于价值的方法相比,这导致了政策梯度方法的两个好处(Sutton and Barto 2018):(i)当最优政策是随机政策(如井字游戏)时,政策梯度本质上能够学习这一点。然而,基于值的算法无法学习最优随机策略。(ii)如果最优策略是确定性的,则通过遵循策略梯度算法,有可能收敛到确定性策略。然而,对于基于值的算法,人们不知道最优策略的真实形式,因此他无法选择在评分时间中使用的最优探索参数(如 e 贪婪 e方法)。在第一个好处中,请注意,可以在 Q 值上使用 softmax 运算符来提供选择每个操作的概率;但是,基于价值的算法本身不能将概率作为随机策略来学习。 类似地,人们可能会选择一个非零 e 作为分数时间,但不知道这样的 e最优值,因此这种方法可能不会产生最优策略。另外,关于第二个好处,请注意,这个问题不仅限于 e贪婪算法。通过在基于值的算法中添加软最大运算符,我们获得了选择每个动作的概率。即使在此设置中,该算法也被设计为获取每个操作的真实值,并且没有已知的真实值映射到选择每个操作的最佳概率,这不一定会导致 0 和 1 操作。同样,soft-max 算子的其他变体(如使用温度参数的 Boltzmann softmax)也无济于事。虽然,温度参数可以帮助获得确定性;在实践中,如果最佳解决方案是确定性的,我们就不会这样做。

*3.2 Multi-Agent RL Notations and Formulation*

3.2 多智能体RL符号和公式


On the other hand, the policy of each agent changes during the training, which results in a mix of observations from different policies in the experience replay. Thus, one cannot use the experience replay without dealing with the non-stationarity. Without experience replay, the DQN algorithm (Mnih et al. 2015) and its extensions can be hard to train due to the sample inefficiency and correlation among the samples. The same issue exists within AC-based algorithms which use a DQN-like algorithm for the critic. Besides, in most problems in MARL, agents are not able to observe the full state of the system, which are categorized as decentralized POMDP (Dec-POMDP). Due to the partial observability and the non-stationarity of the local observations, Dec-POMDPs are even harder problems to solve and it can be shown that they are in the class of NEXP-complete problems (Bernstein et al. 2002). A similar equation to (16) can be obtained for the partially observable environment too.
另一方面,每个智能体的策略在训练期间都会发生变化,这导致在体验回放中混合了来自不同策略的观察结果。因此,如果不处理非平稳性,就无法使用经验回放。如果没有经验回放,DQN 算法(Mnih 等人,2015 年)及其扩展可能很难训练,因为样本效率低下和样本之间的相关性。基于 AC 的算法也存在同样的问题,这些算法使用类似 DQN 的算法。此外,在 MARL 中的大多数问题中,智能体无法观察系统的完整状态,这被归类为分散式 POMDP (Dec-POMDP)。由于局部观测的部分可观测性和非平稳性,Dec-POMDPs是更难解决的问题,可以证明它们属于NEXP完全问题(Bernstein et al. 2002)。对于部分可观测环境,也可以获得与(16)类似的方程。

In the Multi-agent RL, the noise and variance of the rewards increase which results in the instability of the training. The reason is that the reward of one agent depends on the actions of other agents, and the conditioned reward on the action of a single agent can exhibit much more noise and variability than a single agent’s reward. Therefore, training a policy gradient algorithm also would not be effective in general.
在多智能体RL中,奖励的噪声和方差增加,导致训练的不稳定性。原因是一个智能体的奖励取决于其他智能体的行为,而单个智能体行为的条件奖励可能比单个智能体的奖励表现出更多的噪音和可变性。因此,训练策略梯度算法通常也不会有效。

4 Independent Learners

4 独立学习者


DQN algorithm Mnih et al. (2015) utilized experience reply and target network and was able to attain super-human level control on most of the Atari games. The classical IQL uses the tabular version, so one naive idea could be using the DQN algorithm instead of each single Q-learner. Tampuu et al. (2017) implemented this idea and was one of the first papers which took the benefit of the neural network as a general powerful approximator in an IQL-like setting. Specifically, this paper analyzes the performance of the DQN in a decentralized two-agent game for both competitive and cooperative settings. They assume that each agent observes the full state (the video of the game), takes an action by its own policy and the reward values are also known to both agents. The paper is mainly built on the Pong game (from Atari-2600 environment (Bellemare et al. 2013)) in which by changing the reward function the competitive and cooperative behaviors are obtained. In the competitive version, each agent that drops the ball loses a reward point, and the opponent wins the reward point so that it is a zero-sum game. In the cooperative setting, once either of the agents drops the ball, both agents lose a reward point. The numerical results show that in both cases the agents are able to learn how to play the game very efficiently, that is in the cooperative setting, they learn to keep the ball for long periods, and in the competitive setting, the agents learn to quickly beat the competitor.
DQN算法Mnih等人(2015)利用经验回复和目标网络,能够在大多数Atari游戏中实现超人级别的控制。经典的 IQL 使用表格版本,因此一个天真的想法可能是使用 DQN 算法而不是每个 Q 学习器。Tampuu et al. ( 2017) 实现了这一想法,并且是最早利用神经网络作为类似 IQL 环境中的通用强大逼近器的论文之一。具体而言,本文分析了DQN在竞争和合作环境中的去中心化双智能体博弈中的性能。他们假设每个智能体观察完整状态(游戏视频),根据自己的策略采取行动,并且两个智能体也知道奖励值。本文主要建立在乒乓球游戏(来自Atari-2600环境(Bellemare等人,2013))的基础上,通过改变奖励函数,可以获得竞争和合作行为。在竞技版中,每个丢球的代理人都会失去一个奖励点,而对手赢得奖励点,因此这是一个零和游戏。在合作环境中,一旦任何一个智能体丢球,两个智能体都会失去一个奖励点。数值结果表明,在这两种情况下,智能体都能够学习如何非常有效地玩游戏,即在合作环境中,他们学会了长时间持球,而在竞争环境中,智能体学会了快速击败竞争对手。

Experience replay is one of the core elements of the DQN algorithm. It helps to stabilize the training of the neural network and improves the sample efficiency of the history of observations. However, due to the non-stationarity of the environment, using the experience replay in a multi-agent environment is problematic. Particularly, the policy that generates the data for the experience replay is different than the current policy so that the learned policy of each agent can be misleading. In order to address this issue, Foerster et al. (2016) disable the experience replay part of the algorithm, or in Leibo et al. (2017) the old transitions are discarded and the experience replay uses only the recent experiences. Even though these approaches help to reduce the non-stationarity of the environment, but both limit the sample efficiency. To resolve this problem, Foerster et al. (2017) propose two algorithms to stabilize the experience reply in IQL-type algorithms. They consider a fully cooperative MARL with local observation-action. In the first approach, each transition is augmented with the probability of choosing the joint action. Then, during the loss calculation, the importance sampling correction is calculated using the current policy. Thus, the loss function is changed to:
体验回放是DQN算法的核心要素之一。它有助于稳定神经网络的训练,提高观测历史的样本效率。但是,由于环境的非平稳性,在多智能体环境中使用体验回放是有问题的。具体而言,为体验重播生成数据的策略与当前策略不同,因此每个代理的学习策略可能会产生误导。为了解决这个问题,Foerster et al. ( 2016) 禁用了算法的体验回放部分,或者在 Leibo et al. ( 2017) 中,旧的过渡被丢弃,体验重播仅使用最近的体验。尽管这些方法有助于减少环境的非平稳性,但两者都限制了样品效率。为了解决这个问题,Foerster等人(2017)提出了两种算法来稳定IQL类型算法中的体验回复。他们认为与当地观察行动完全合作的 MARL。在第一种方法中,每个转换都会增加选择联合行动的概率。然后,在损失计算过程中,使用当前策略计算重要性抽样校正。因此,损失函数更改为:

Omidshafiei et al. (2017) propose another extension of the experience replay for the MARL. They consider multi-task cooperative games, with independent partially observable learners such that each agent only knows its own action, with a joint reward. An algorithm, called HDRQN, is proposed which is based on the DRQN algorithm (Hausknecht and Stone 2015) and the Hysteretic Q-learning (Matignon et al. 2007). Also, to alleviate the non-stationarity of MARL, the idea of Concurrent Experience Replay Trajectories (CERTs) is proposed, in which the experience replay gathers the experiences of all agents in any period of one episode and also during the sampling of a mini-batch, it obtains the experiences of one period of all agents together. Since they use LSTM, the experiences in the experience replay are zero-padded (adds zero to the end of the experiments with smaller sizes to make the size of all experiments equal). Moreover, in the multi-task version of HRDQN, there are different tasks that each has its own transition probability, observation, and reward function. During the training, each agent observes the task ID, while it is not accessible in the inference time. To evaluate the model, a two-player game is utilized, in which agents are rewarded only when all the agents simultaneously capture the moving target. In order to make the game partially observable, a flickering screen is used such that with 30% chance the screen is flickering. The actions of the agents are moving north, south, west, east, or waiting. Additionally, actions are noisy, i.e. with 10% probability the agent might act differently than what it wanted.
Omidshafiei et al. ( 2017) 提出了 MARL 体验回放的另一种扩展。他们考虑了多任务合作游戏,具有独立的部分可观察的学习者,因此每个智能体只知道自己的行动,并有共同的奖励。提出了一种称为HDRQN的算法,该算法基于DRQN算法(Hausknecht和Stone,2015)和滞后Q学习(Matignon等人,2007)。此外,为了缓解MARL的非平稳性,提出了并发体验回放轨迹(CERTs)的思想,其中体验回放收集了一集任意时间段内所有智能体的经验,并且在小批量采样过程中,它一起获取了所有智能体的一个时期的经验。由于他们使用 LSTM,因此体验重播中的体验是零填充的(在较小大小的实验末尾添加零,以使所有实验的大小相等)。此外,在 HRDQN 的多任务版本中,有不同的任务,每个任务都有自己的转换概率、观察和奖励函数。在训练期间,每个代理都会观察任务 ID,而在推理时无法访问它。为了评估该模型,使用了双人游戏,其中只有当所有智能体同时捕获移动目标时,智能体才会获得奖励。为了使游戏部分可观察,使用了闪烁的屏幕,使得屏幕 30% 有可能闪烁。特工的行动是向北、向南、向西、向东移动或等待。此外,动作是嘈杂的,即智能体的行为 10% 可能与它想要的有所不同。

5 Fully Observable Critic

5 完全可观察的批评家


......................................................

6 Value Function Factorization

6 值函数分解

Consider a cooperative multi-agent problem in which we are allowed to share all information among the agents and there is no communication limitation among the agents. Further, let’s assume that we are able to deal with the huge action space. In this scenario, a centralized RL approach can be used to solve the problem, i.e., all state observations are merged together and the problem is reduced to a single agent problem with a combinatorial action space. However, Sunehag et al. (2018) shows that naive centralized RL methods fail to find the global optimum, even if we are able to solve the problems with such a huge state and action space. The issue comes from the fact that some of the agents may get lazy and not learn and cooperate as they are supposed to. This may lead to the failure of the whole system. One possible approach to address this issue is to determine the role of each agent in the joint reward and then somehow isolate its share out of it. This category of algorithms is called Value Function Factorization.
考虑一个合作的多智能体问题,在这个问题中,我们被允许在智能体之间共享所有信息,并且智能体之间没有通信限制。此外,让我们假设我们能够处理巨大的动作空间。在这种情况下,可以使用集中式 RL 方法来解决问题,即将所有状态观测值合并在一起,并将问题简化为具有组合动作空间的单个智能体问题。然而,Sunehag et al. ( 2018) 表明,即使我们能够在如此巨大的状态和动作空间下解决问题,朴素的集中式 RL 方法也无法找到全局最优。问题来自这样一个事实,即一些代理可能会变得懒惰,而不是像他们应该的那样学习和合作。这可能会导致整个系统的故障。解决这个问题的一种可能方法是确定每个代理人在联合奖励中的作用,然后以某种方式将其份额从中分离出来。这类算法称为值函数分解。

In POMDP settings, if the optimal reward-shaping is available, the problem reduces to train several independent learners, which simplifies the learning. Therefore, having a reward-shaping model would be appealing for any cooperative MARL. However, in practice it is not easy to divide the received reward among the agents since their contribution to the reward is not known or it is hard to measure. Following this idea, the rest of this section discusses the corresponding algorithms.
在 POMDP 设置中,如果存在最佳奖励塑造,则问题可以减少到训练多个独立学习者,从而简化学习。因此,拥有奖励塑造模型对任何合作的 MARL 都具有吸引力。然而,在实践中,由于代理人对奖励的贡献未知或难以衡量,因此在代理之间分配收到的奖励并不容易。根据这个想法,本节的其余部分将讨论相应的算法。




全部评论 (0)

还没有任何评论哟~