【综述】【多智能体深度强化学习的综述与评论】
目录
A survey and critique of multiagent deep reinforcement learning
多智能体深度强化学习的综述与评论
Abstract 摘要
1 Introduction
1 引言
2 Single-agent learning
2 单智能体学习
2.1 Reinforcement learning
2.1 强化学习
2.2 Deep reinforcement learning
2.2 深度强化学习
3 Multiagent deep reinforcement learning (MDRL)
3 多智能体深度强化学习 (MDRL)
3.1 Multiagent learning
3.1 多智能体学习
3.2 MDRL categorization
3.2 MDRL分类
3.3 Emergent behaviors
3.3 紧急行为
3.4 Learning communication
3.4 学习交流
3.5 Learning cooperation
3.5 学习合作
3.6 Agents modeling agents
3.6 智能体建模智能体
4 Bridging RL, MAL and MDRL
4 桥接 RL、MAL 和 MDRL
4.1 Avoiding deep learning amnesia: examples in MDRL
4.1 避免深度学习遗忘症:MDRL 中的示例
4.2 Lessons learned
4.2 经验教训
4.3 Benchmarks for MDRL
4.3 MDRL的基准
4.4 Practical challenges in MDRL
4.4 MDRL的实际挑战
4.5 Open questions
4.5 开放性问题
5 Conclusions
5 结论
Notes 笔记
A survey and critique of multiagent deep reinforcement learning
多智能体深度强化学习的综述与评论
-
Published: 16 October 2019
出版日期:2019年10月16日 -
https://link.springer.com/article/10.1007/s10458-019-09421-1
Abstract 摘要
Deep reinforcement learning (Reinforcement Learning, RL) has achieved significant accomplishments over the past few years.
The primary objective of this article is to offer a comprehensive overview of current multiagent deep reinforcement learning (MDRL) literature.
Our goal is to unite researchers through synthesis of existing literature on RL and MAL while encouraging innovative advancements within the broader multiagent community.
1 Introduction
1 引言
大约20年前Stone和Veloso的开创性调查为定义多智能体系统 MAS领域及其在AI背景下的开放问题奠定了基础
The advancements in MAL have been accompanied by notable progressions in the field of artificial intelligence. Notably, research has been particularly focused on single-player video games [221]; more recently, advancements have extended to multiplayer games, such as mastering the game of Go [291, 293], the game of poker [50, 224], and involving teams with competitive objectives [235] and StarCraft II [339].
While various techniques and algorithms were employed in the aforementioned scenarios, overall, they predominantly consist of techniques derived from two primary domains: reinforcement learning (RL) [367] and deep learning [also incorporating methods from 367 , 625 ].
强化学习属于机器学习的一个领域,在动态环境中通过执行动作进行交互的学习过程。尽管如此,在传统机器学习和强化学习中面临的另一个主要挑战是需要手动设计高质量的特征来进行学习。现代深度学习方法能够实现高效的特征表示,并能自动识别出这些特征[参考文献: ¹⁸⁴, ²⁸¹]。近年来的研究表明,在多个领域如计算机视觉和自然语言处理等方面取得了显著成果[参考文献: ¹⁸⁴, ²⁸¹]。在深度学习中占据核心地位的一个关键因素就是利用神经网络架构来实现高维数据中的紧凑表示[参考文献: ¹³]。
In deep reinforcement learning (DRL) [13, 101], deep neural networks undergo extensive training to approximate optimal policies and/or value functions through various methods. By employing this approach, where the deep neural network functions as a sophisticated function approximator, systems achieve remarkable levels of generalization. One of the main strengths of DRL lies in its capability to scale effectively to complex environments characterized by high-dimensional state and action spaces. However, while significant successes have been achieved in visual domains such as Atari games [359], there remains substantial potential for more realistic applications involving intricate dynamics that may not rely solely on visual inputs.
DRL has long been recognized as a critical component in the development of general AI systems [179], with significant integration achieved through various techniques, such as search [291], planning [320], and the emergence of multiagent systems, particularly in the field of multiagent deep reinforcement learning (MDRL) [232, 251]. Footnote 1 provides additional context for further reference.
Due to the presence of multiagent pathologies, such as the moving target problem (non-stationarity) and the curse of dimensionality, learning in multiagent settings is significantly more challenging than in single-agent scenarios. Additionally, issues like multiagent credit assignment, global exploration, and relative overgeneralization further complicate the learning process. These challenges are well-documented in works cited by [55, 141, 289; 55, 289; 2, 355; 213; 105, 247, 347], which provide a comprehensive analysis of the complexities inherent in multiagent learning environments. Despite these challenges, leading AI conferences such as AAAI, ICML, ICLR, IJCAI and NeurIPS have published numerous successful works in the domain of model-based reinforcement learning (MDRL). In light of these advancements and considering their relevance to existing literature [CR: ..., CR: ...], it is crucial to first conduct a comprehensive review of recent MDRL developments and subsequently explore their relationship with established research frameworks.
This article makes a significant contribution to advancing the field of artificial intelligence by offering a comprehensive overview of current research in Model-based Deep Reinforcement Learning (MDRL). The purpose of this work is specifically aimed at supplementing existing literature on multiagent learning. It systematically reviews studies that explore the cooperative nature of intelligent agents through both practical applications and theoretical frameworks. Additionally, it examines approaches that leverage knowledge reuse within multiagent reinforcement learning contexts and delves into advancements within single-agent deep reinforcement learning domains.
First, we introduce a concise overview of key algorithms in reinforcement learning, including Q-learning and REINFORCE (see Sect.2). Second, we examine DRL by highlighting the challenges encountered in this context while surveying recent advancements (see Sect.3). Third, we discuss the multiagent framework by providing an overview of its main challenges and outcomes (see Sect.4). Finally, we categorize recent advances into four distinct groups to facilitate their organization.
Emergent behavior analysis involves assessing single agent-based deep reinforcement learning models within multi-agent environments. For instance, such as examples including classic video games like Atari games, complex social interactions such as social dilemmas, and competitive 3D simulations.
Learning acquisition: agents gain interoperation protocols to achieve collaborative tasks.
Engaging in cooperative learning, agents are cooperated by them solely relying on their own actions and local sensory inputs.
Agents representing other agents: intelligent agents infer or predict the behavior of other agents in order to execute tasks aimed at achieving a goal (e.g., best response learners).
For each category, we are committed to providing a comprehensive description alongside an overview of the most recent advancements (please refer to Section 3.2 and Tables 1, 2, 3, and 4 for detailed information). Subsequently, we take a step back to examine how these developments intersect with existing literature. Within this context, we present illustrative examples of methods and algorithms initially introduced in reinforcement learning (RL) and multi-agent learning (MAL) that have been successfully adapted to multiagent dynamic reinforcement learning (MDRL) (see Section 4.1). Additionally, we offer practical insights gained from analyzing current MDRL research by highlighting key lessons learned (see Section 4.2), directing readers to recent benchmarks for multiagent systems (see Section 4.3), and encouraging further exploration through these resources. Furthermore, we critically assess the challenges inherent in MDRL implementations, including reproducibility issues, hyperparameter optimization complexities, and computational resource demands (see Section 4.4). Finally, we conclude with a discussion of open research questions that remain unresolved in this field (see Section 4.5). The concluding remarks from this study are presented in Section 5.
This study aims to introduce a current and active domain (MDRL) while simultaneously encouraging future research efforts that can leverage the extensive body of literature available in multiagent learning. This investigation also intends to support researchers with expertise in either deep reinforcement learning (DRL) or multiagent learning (MAL) by offering them a unified framework for understanding recent advancements and unresolved challenges within the field of model-based deep reinforcement learning (MDRL). Furthermore, this work seeks to foster collaboration among related subfields by avoiding fragmentation through isolated communities with limited interaction [6, 8, 23, 37]]
Fig. 1 图1
[

](https://link.springer.com/article/10.1007/s10458-019-09421-1/figures/1)
Categories of different MDRL works are presented in this section. The first part, a, focuses on the analysis of emergent behaviors by assessing single-agent DRL algorithms within multiagent scenarios. b investigates learning through communication, where agents interact via actions and messages. c delves into learning cooperation, examining how agents can collaborate using only actions and (local) observations. d explores the concept of agents modeling others, which involves reasoning about other agents to achieve a specific objective (e.g., cooperative or competitive). For further detailed insights, readers can refer to Sections 3.3–3.6 and Tables 1–4 provided in the cited references.
2 Single-agent learning
2 单智能体学习
This section introduces the foundational framework of reinforcement learning and its core elements before delving into Deep reinforcement learning, highlighting its unique challenges and recent advancements. For a comprehensive understanding, we refer the reader to authoritative resources such as [13,101,164,315,353].
2.1 Reinforcement learning
2.1 强化学习




-policy gradient algorithms differ from value-based approaches in that they focus on optimizing policies rather than estimating values [Ref: 175]. Unlike traditional value-based methods that rely on indirect optimization techniques, policy gradient methods enable the direct learning of parameterized policies without requiring intermediate value computations.


The policy gradient update can be generalized to include a comparison to an arbitrary baseline of the state [354]. The baseline, b(s), can be any function, as long as it does not vary with the action; the baseline leaves the expected value of the update unchanged, but it can have an effect on its variance [315]. A natural choice for the baseline is a learned state-value function, this reduces the variance, and it is bias-free if learned by MC.Footnote3 Moreover, when using the state-value function for bootstrapping (updating the value estimate for a state from the estimated values of subsequent states) it assigns credit (reducing the variance but introducing bias), i.e., criticizes the policy’s action selections. Thus, in actor-critic methods [175], the actor represents the policy, i.e., action-selection mechanism, whereas a critic is used for the value function learning. In the case when the critic learns a state-action function (Q function) and a state value function (V function), an advantage function can be computed by subtracting state values from the state-action values [283, 315]. The advantage function indicates the relative quality of an action compared to other available actions computed from the baseline, i.e., state value function. An example of an actor-critic algorithm is Deterministic Policy Gradient (DPG) [292]. In DPG [292] the critic follows the standard Q-learning and the actor is updated following the gradient of the policy’s performance [128], DPG was later extended to DRL (see Sect. 2.2) and MDRL (see Sect. 3.5). For multiagent learning settings the variance is further increased as all the agents’ rewards depend on the rest of the agents, and it is formally shown that as the number of agents increase, the probability of taking a correct gradient direction decreases exponentially [206]. Recent MDRL works addressed this high variance issue, e.g., COMA [97] and MADDPG [206] (see Sect. 3.5).
策略梯度更新可以泛化为包括与状态的任意基线的比较[354]。基线 b(s) 可以是任何函数,只要它不随动作而变化;基线使更新的预期值保持不变,但可能会对其方差产生影响[315]。基线的自然选择是学习到的状态值函数,这减少了方差,如果通过 MC 学习,它是无偏差的。 此外,当使用状态值函数进行引导时(从后续状态的估计值更新状态的值估计值),它会分配信用(减少方差但引入偏差),即批评策略的操作选择。 Footnote3 因此,在行动者-批评方法[175]中,行动者代表政策,即行动选择机制,而批评者则用于价值函数学习。当批评者学习状态动作函数(Q函数)和状态值函数(V函数)时,可以通过从状态动作值中减去状态值来计算优势函数[283,315]。优势函数表示与从基线计算的其他可用操作(即状态值函数)相比,操作的相对质量。Actor-critic算法的一个例子是确定性策略梯度(DPG)[292]。在 DPG [ 292] 中,批评者遵循标准的 Q 学习,而 actor 则按照策略性能的梯度进行更新 [ 128],DPG 后来扩展到 DRL(见第 2.2 节)和 MDRL(见第 3.5 节)。 对于多智能体学习设置,方差进一步增加,因为所有智能体的奖励都依赖于其他智能体,并且正式表明,随着智能体数量的增加,采取正确梯度方向的概率呈指数下降[206]。最近的MDRL工作解决了这个高方差问题,例如COMA [ 97]和MADDPG [ 206](见第3.5节)。
Policy gradient methods exhibit a clear relationship with deep reinforcement learning, as a neural network may serve as a representation of the policy, with its input corresponding to an encoding of the state, and its output constituting action selection probabilities or estimates for continuous action values [192]. The weights within this neural network structure represent the policy parameters.
2.2 Deep reinforcement learning
2.2 深度强化学习
Although tabular reinforcement learning (RL) methods such as Q-learning have proven effective in domains where the curse of dimensionality is not a significant issue, they still exhibit several limitations: they struggle with slow learning when the state space becomes extensive, fail to generalize across states, and often require manually crafted state representations[3]. Various function approximators have been developed to mitigate these issues, including decision trees[2], tile coding[3], radial basis functions[...], and locally weighted regression[...] which collectively aim to approximate the value function more efficiently.

However, extending deep learning to RL problems brings extra challenges such as non-i.i.d. (not independently and identically distributed) data. Many supervised learning methods assume that training datasets are derived from an i.i.d. stationary distribution [36, 269, 28]. In contrast, in RL training data consists of highly correlated sequential agent-environment interactions which violate the independence assumption Moreover, RL training data distribution is non-stationary since the agent actively learns while exploring different parts of the state space thereby violating the identical distribution condition [22].
In practice, using function approximators in RL requires making crucial representational decisions and poor design choices can result in estimates that diverge from the optimal value function [1, 21, 46, 112, 334, 351]. In particular, function approximation, bootstrapping, and off-policy learning are considered the three main properties that when combined, can make the learning to diverge and are known as the deadly triad [315, 334]. Recently, some works have shown that non-linear (i.e., deep) function approximators poorly estimate the value function [104, 151, 331] and another work found problems with Q-learning using function approximation (over/under-estimation, instability and even divergence) due to the delusional bias: “delusional bias occurs whenever a backed-up value estimate is derived from action choices that are not realizable in the underlying policy class”[207]. Additionally, convergence results for reinforcement learning using function approximation are still scarce [21, 92, 207, 217, 330]; in general, stronger convergence guarantees are available for policy-gradient methods [316] than for value-based methods [315].
在实践中,在RL中使用函数逼近器需要做出关键的表示决策,而糟糕的设计选择可能导致估计偏离最优值函数[1,21,46,112,334,351]。特别是,函数逼近、自举和偏离策略学习被认为是三个主要属性,当它们结合在一起时,可以使学习发散,被称为致命的三联征[315,334]。最近,一些工作表明,非线性(即深度)函数近似器对价值函数的估计很差[104,151,331],另一项工作发现,由于妄想偏差,使用函数近似(高估/低估,不稳定甚至背离)的Q学习存在问题:“每当从基础策略类中无法实现的行动选择中得出备份值估计时,就会发生妄想偏差”[207]。此外,使用函数逼近的强化学习的收敛结果仍然很少[ 21, 92, 207, 217, 330];一般来说,策略梯度方法[316]比基于值的方法[315]具有更强的收敛保证。
Currently, we explore how existing DRL methodologies tackle these challenges through a brief revisit of value-based approaches such as DQN [ 221], policy gradient methods such as Proximal Policy Optimization (PPO) [ 283], along with actor-critic approaches including Asynchronous Advantage Actor-Critic (A3C) [ 158]. Referencing recent surveys on single-agent DRL research [



Fig. 2 图2

The ER缓存通过提供稳定性来增强学习过程,在从缓存中随机采样批次以缓解由非独立同分布数据(non-i.i.d data)所导致的问题方面具有显著作用。然而,在实际应用中存在一些缺点:例如更高的内存占用以及每轮真实交互所需计算资源的增加[2]。此外,在强化学习领域中,ER缓存主要用于非策略强化学习方法(Reinforcement Learning),因为这些方法可能导致早期策略与当前策略之间内容的不匹配问题[3]。对于多智能体环境下的ER缓存扩展并非易事,请参阅第3节以及第4节的相关讨论[4][5][6]。最近的研究工作旨在减少灾难性遗忘现象的发生(即当受过训练的人工神经网络在平稳训练分布下表现出对先前学习任务的良好表现能力而无法在此后的任务中保持良好性能时就会发生这种情况)以及与ER缓存结合的方法在分布式强化学习(DRL)以及多代理分布式强化学习(MDRL)中的效果[7][8][9]
Various extensions have been made to the original DQN framework, among which notable are the introduction of double estimators [130] aimed at mitigating overestimation biases through the use of Double DQN [336] (see Section 4.1). Additionally, advancements in Q-function decomposition have emerged, particularly through decomposition of the Q-function utilizing a dueling architecture [345]. This approach involves separating learning into two streams: one branch estimating state values while another estimates action advantages. These streams are then integrated in the final layer to produce accurate Q-values, achieving an improvement over its predecessor.
In practice,DQN is trained using an input of four consecutive frames,i.e.,the last four frames that the agent has experienced.If a game requires more than four memory slots,it will seem non-Markovian to DQN because its future states and rewards depend not only on its current inputs,but also on its history[Ref: 132].Therefore,DQN's performance degrades significantly when provided with incomplete state observations,such as a single frame,it assumes full observability of all system states.
Real-world tasks often feature incomplete and noisy state information resulting from partial observability (see Sect. 2.1). Deep Recurrent Q-Networks (DRQN) [131] proposed using recurrent neural networks , in particular, Long Short-Term Memory (LSTMs) cells [147] in DQN, for this setting. Consider the architecture in Fig. 2 with the first dense layer after convolution replaced by a layer of LSTM cells. With this addition, DRQN has memory capacity so that it can even work with only one input frame rather than a stacked input of consecutive frames. This idea has been extended to MDRL, see Fig. 6 and Sect. 4.2. There are also other approaches to deal with partial observability such as finite state controllers [218] (where action selection is performed according to the complete observation history) and using an initiation set of options conditioned on the previously employed option [302].
实际任务通常具有不完整和嘈杂的状态信息,这是由于部分可观测性造成的(参见第 2.1 节)。深度递归Q网络(DRQN)[131]建议使用递归神经网络,特别是DQN中的长短期记忆(LSTM)细胞[147]来设置。考虑图 2 中的架构,卷积后的第一个致密层被一层 LSTM 单元所取代。通过这一新增功能,DRQN 具有内存容量,因此它甚至可以只使用一个输入帧,而不是连续帧的堆叠输入。这个想法已经扩展到MDRL,见图6和第4.2节。还有其他方法可以处理部分可观测性,例如有限状态控制器[218](根据完整的观测历史执行动作选择)和使用以先前使用的选项[302]为条件的启动选项集。
Fig. 3 图3
[

](https://link.springer.com/article/10.1007/s10458-019-09421-1/figures/3)

Policy gradient methods For many tasks, particularly for physical control, the action space is continuous and high dimensional where DQN is not suitable. Deep Deterministic Policy Gradient (DDPG) [192] is a model-free off-policy actor-critic algorithm for such domains, based on the DPG algorithm [292] (see Sect. 2.1). Additionally, it proposes a new method for updating the networks, i.e., the target network parameters slowly change (this could also be applicable to DQN), in contrast to the hard reset (direct weight copy) used in DQN. Given the off-policy nature, DDPG generates exploratory behavior by adding sampled noise from some noise processes to its actor policy. The authors also used batch normalization [152] to ensure generalization across many different tasks without performing manual normalizations. However, note that other works have shown batch normalization can cause divergence in DRL [274, 335].
策略梯度方法 对于许多任务,特别是对于物理控制,动作空间是连续的和高维的,而 DQN 不适合。深度确定性策略梯度(DDPG)[192]是基于DPG算法[292]的无模型非策略行为者-批评算法(参见第2.1节)。此外,它提出了一种更新网络的新方法,即目标网络参数缓慢变化(这也适用于DQN),这与DQN中使用的硬复位(直接权重复制)形成鲜明对比。鉴于策略外的性质,DDPG 通过将来自某些噪声进程的采样噪声添加到其执行组件策略中来生成探索性行为。作者还使用批量归一化[152]来确保在许多不同任务中泛化,而无需执行手动归一化。然而,请注意,其他研究表明批量归一化会导致DRL的发散[274,335]。


Fig. 4 图4
[

](https://link.springer.com/article/10.1007/s10458-019-09421-1/figures/4)
Asynchronous Advantage Actor-Critic (A3C)采用了多个CPU工线而无需ER缓冲区。每个工线拥有自己的神经网络,并独立地与环境交互来计算损失和梯度。然后将计算出的梯度传递给全局神经网络以优化参数并实现异步同步。该分布式系统专为单智能体深度强化学习设计。在没有GPU的标准笔记本电脑多核的情况下,在更多Atari游戏中获得更好的性能的同时使用更少的时间并且在没有GPU的情况下能够更好地完成训练任务[219]。然而我们注意到最近的方法既利用多核生成更高效的训练数据又利用GPU提高学习效率

注
Trust Region Policy Optimization (TRPO) [283] and Proximal Policy Optimization (PPO) [284] are two prominent policy gradient methods that have recently gained prominence in the field of reinforcement learning. While TRPO represents an established approach, PPO stands out as the state-of-the-art method due to its simplicity and enhanced empirical performance. Recent studies [151] have revealed that both PPO and TRPO exhibit unexpected behaviors, with gradient estimates showing weak correlation to true gradients and value networks often failing to accurately predict the true value function. Unlike traditional policy gradient algorithms, PPO introduces a mechanism through its loss function that mitigates abrupt policy changes during training, drawing inspiration from early work by Kakade [166]. To further enhance its capabilities, PPO has been extended to a distributed variant known as Distributed PPO (DP-PPo) [134]. It is important to note that distributed approaches like DP-PPo or A3C leverage parallel processing primarily for improving data generation efficiency through multi-core CPU utilization in single-agent reinforcement learning scenarios. These methods should not be misconstrued as multiagent systems, except for recent exploratory work that attempts to exploit such parallelization in multiagent environments [19].
Within the context of entropy-regularized reinforcement learning[126], there exists a link between policy gradient algorithms and Q-learning[282]. This relationship is evident within the framework where minor adjustments are made to both the Q-functions and the value function to account for policy entropy. Drawing inspiration from this principle, Soft Actor-Critic (SAC)[127] emerges as a cutting-edge algorithm that simultaneously learns a stochastic policy, two Q-functions inspired by those in Double Q-learning, and a value function. The SAC methodology alternates between gathering experience using its current policy and updating its model based on batches sampled from an experience replay buffer.
We have examined a range of recent DRL algorithms, acknowledging that this list is not exhaustive but serves to illustrate various cutting-edge techniques and approaches that will be particularly valuable when detailing MDRL methods in the following section. Although comprehensive, this compilation offers a broad perspective on emerging technologies within the domain of deep reinforcement learning.
3 Multiagent deep reinforcement learning (MDRL)
3 多智能体深度强化学习 (MDRL)
首先, 我们简要介绍了多智能体学习的整体架构, 并深入探讨了MDRL的分类及其研究进展.
3.1 Multiagent learning
3.1 多智能体学习
多智能体环境中的学习本质上显著more complex than单智能体环境下



Convergence findings by Littman [200] explored the convergence characteristics of reinforcement learning joint action agents [70] within Markov games. The study revealed that in adversarial settings (zero-sum games), optimal play can be guaranteed against any opponent, as demonstrated by Minimax Q-learning [198]. In coordination scenarios (e.g., cooperative games with shared reward functions), Nash Q-learning [149] and Friend-or-Foe Q-learning [199] were found to ensure convergence to optimal behavior. Notably, no value-based RL algorithms with proven convergence properties are currently known for other types of environments [200].
Recent work on MDRL have addressed scalability and have focused significantly less on convergence guarantees, with few exceptions [22, 40, 255, 297]. One notable work has shown a connection between update rules for actor-critic algorithms for multiagent partially observable settings and (counterfactual) regret minimizationFootnote7: the advantage values are scaled counterfactual regrets. This lead to new convergence properties of independent RL algorithms in zero-sum games with imperfect information [300]. The result is also used to support policy gradient optimization against worst-case opponents, in a new algorithm called Exploitability Descent [204].Footnote8
最近关于MDRL的工作已经解决了可扩展性问题,而对收敛保证的关注明显减少,只有少数例外[22,40,255,297]。一项值得注意的工作表明,多智能体部分可观察设置的 actor-critic 算法的更新规则与(反事实)后悔最小化 Footnote7 之间存在联系:优势值是缩放的反事实遗憾。这导致了独立RL算法在信息不完全的零和博弈中的新收敛特性[300]。该结果还用于支持针对最坏情况对手的策略梯度优化,这是一种称为可利用性下降的新算法[204]。 Footnote8
该领域的一些开创性著作介绍了多个领域的研究进展 [[CR: 该段落中的链接已替换为CR标识符]] ,其中许多研究聚焦于多智能体系统中的协调机制 [[CR: 链接已替换]] 。值得注意的是,在某些情况下 ,这些方法已被证明并非收敛 ,而是能够适应对手类别变化的最佳策略 [[CR: 链接已替换]] 。我们向感兴趣的读者推荐关于多智能体领域收敛的开创性著作 [ CR: 列举了多个具体的研究成果 ] 等著者 ] 。请注意 ,这些方法已被证明并非收敛 ,而是能够适应对手类别变化的最佳策略 [ CR: 列举了其他相关研究 ] 。
MAL中还存在诸多常见问题,如动作阴影[...]以及维度诅咒[...]等。本研究并未深入探讨各个具体问题。然而,我们向有兴趣的读者推荐一些关于一般MAL以及其他领域的详细资源
3.2 MDRL categorization
3.2 MDRL分类
Within Section 2.2, we summarized the recent advancements in single-agent DRL, as beyond the scope of this article such an extensive compilation would not be feasible. The surge in research efforts has resulted in DRL becoming more advanced and increasingly sophisticated through integration with additional methodologies [[...],[...],[...]]. A promising avenue for future exploration lies in expanding upon these approaches by investigating their applicability within a multiagent framework.]
We conducted an in-depth analysis of the latest research efforts (which fall outside the scope of prior MAL surveys [6,141], as well as those that are not covered by this survey) that have a direct connection to MDRL. Our study introduces four distinct categories inspired by previous surveys [6,55,248,305], each designed to conveniently describe and represent current research efforts. It is important to note that some research efforts may fit into multiple categories (they are not mutually exclusive), hence their summaries are provided in all applicable tables. For ease of understanding, however, we focus on presenting each effort within a single category when discussing them in the text. Additionally, for each research effort, we specify its learning type—whether it is based on value methods (e.g., DQN) or policy gradient approaches (e.g., actor-critic). Furthermore, we clarify whether the environment is evaluated in a fully cooperative setting, a fully competitive environment, or a mixed scenario involving both cooperation and competition.
Emergent behaviors analysis has been conducted in general, with these works focusing primarily on analyzing and assessing deep reinforcement learning (DRL) algorithms within a multiagent environment. Specifically, they have examined various algorithmic approaches such as DQN [ ¹⁸⁸ , ²⁶⁴ , ³²² ], PPO [ ²⁴ , ²⁶⁴ ] alongside other methodologies including Q-learning-based strategies such as A3C [ ¹⁸⁷ , ²²⁵ , ³²₂ ]. These studies have concentrated their analysis on three primary interaction scenarios: cooperative setups, competitive environments, and hybrid configurations that combine both elements. Refer to Section 3.3 for detailed discussions on these settings along with the supporting empirical evidence presented in Table ¹.
Learning communication [96, 183, 225, 253, 256, 312]. These works explore a sub-area in which agents can share information with communication protocols, for example through direct messages [96] or via a shared memory [256]. This area is attracting attention and it had not been explored much in the MAL literature. See Sect. 3.4 and Table 2.
学习交流 [ 96, 183, 225, 253, 256, 312].这些工作探索了一个子领域,在这个子领域中,智能体可以通过通信协议共享信息,例如通过直接消息[96]或通过共享内存[256]。这一领域正在引起人们的注意,在MAL文献中并没有得到太多的探讨。见第 3.4 节和表 2。
Learning cooperation While a new field of study, fostering cooperation among learning agents has a long-standing research history within the domain of multi-agent reinforcement learning (MDRL). The examined works are assessed in either cooperative or mixed settings within this category. Some research efforts have drawn inspiration from MAL concepts such as leniency, hysteresis, and difference rewards and have subsequently extended into the domain of MDRL. A notable exception [ 99] draws upon RL's experience replay buffer and adapts it for use in MDRL. See Section 3.5 for further details on this topic.
注
Within the remainder of this section, we will outline all categories together with their summaries.
3.3 Emergent behaviors
3.3 紧急行为
These studies, which have conducted an analysis of self-contained DRL agents (see Sect. 3.1), have explored their behavior from the standpoint of types of novel behaviors (e.g., cooperative and competitive).
解析
Leibo等[188]同时探讨了独立DQN在序列社会困境背景下的应用:满足特定条件的马尔可夫博弈[188]。这项研究的重点在于表明合作或竞争行为不仅以离散(原子)形式存在,而且在时间上具有延展性(基于政策)。在单次回合设置下与社会困境相关的背景下,Lerer和Peysakhovich[189]将著名的针锋相对(TFT)策略[15]扩展到深度强化学习(DRL),并(从理论和实验上)证明这种智能体能够维持合作行为。为了构建这样的智能体作者采用了自我对弈方法并结合了两种奖励机制:自私型与合作型奖励机制。此前针对Q-learning代理的社会困境问题不同多 agent算法已被设计出来促进合作行为[77,303]
Self-play represents a valuable concept in the realm of learning algorithms, as it ensures convergence within specific game classes [43, 291, 325]. This approach has been established as a standard technique in previous reinforcement learning (RL) and multi-agent learning (MAL) works. However, despite its widespread use, self-play is prone to issues arising from forgetting past knowledge [180, 186, 275] (see Section 4.5 for further discussion on this topic). To address these limitations, Leibo et al. [187] introduced an extension of self-play known as Malthusian reinforcement learning, which models community coevolution and avoids local optima by improving upon independent agent strategies with intrinsic motivation [30]. While this advancement offers significant benefits, it falls short of integrating into the most advanced evolutionary and genetic algorithm frameworks. Evolutionary strategies have proven effective in solving RL problems [226] and evolving function approximators [351]. Furthermore, they have been successfully applied to multi-agent scenarios for computing approximate Nash equilibria [238] and as metaheuristic optimization algorithms [53, 54, 150, 248].
Bansal et al. [24] investigated competitive environments using the MuJoCo simulator [327]. Their approach involved training independent learning agents with PPO and introducing two primary modifications to address the MAL nature of the problem. First, they implemented exploration rewards [122], which provide dense reward signals enabling agents to learn basic (non-competitive) skills. Over time, these rewards gradually decreased in influence, emphasizing environmental (competitive) rewards. Exploration rewards were inspired by early work in robotics [212] and single-agent reinforcement learning [176], aiming to enhance sample efficiency through continuous feedback (Ng et al. [231] studied reward function modifications in MDPs). For multiagent scenarios, these dense rewards helped agents develop non-competitive skills early on, increasing their likelihood of producing positive rewards through random actions. The second modification introduced opponent sampling, which maintained a repository of older opponent versions for sampling, contrasting with the use of the most recent version.
Raghu et al. [264] investigated how DRL algorithms (DQN, A2C, and PPO) performed in a family of two-player zero-sum games with tunable complexity, called Erdos-Selfridge-Spencer games [91, 299]. Their reasoning is threefold: (i) these games provide a parameterized family of environments where (ii) optimal behavior can be completely characterized, and (iii) support multiagent play. Their work showed that algorithms can exhibit wide variation in performance as the algorithms are tuned to the game’s difficulty.
Raghu等[264]研究了DRL算法(DQN、A2C和PPO)在一系列具有可调复杂性的双人零和博弈(称为Erdos-Selfridge-Spencer博弈)中的表现[91,299]。他们的理由有三:(i)这些游戏提供了一个参数化的环境系列,其中(ii)可以完全表征最佳行为,以及(iii)支持多智能体游戏。他们的研究表明,当算法根据游戏的难度进行调整时,算法的性能可能会有很大的变化。
Lazaridou et al. [183] introduced a framework for language acquisition that relies on multiagent communication. The agents, implemented as feed-forward neural networks, aimed to develop an emergent language to solve the assigned task. This task was conceptualized as a signaling game [103], where two agents, the sender and receiver, were presented with pairs of images. The sender was informed about one image being the target and was permitted to transmit a message (drawn from a predefined vocabulary) to the receiver. Positive reinforcement was contingent upon the receiver correctly identifying the target image. The experimental results indicated that the agents could effectively coordinate within the visual domain under investigation. To delve into the semantic characteristics of the learned communication protocol, they examined whether symbol usage reflected the semantics of the visual space. Despite some variability in symbol assignment, they observed that high-level object groups consistently correlated with specific learned symbols, as analyzed through t-SNE [210]. This visualization technique, commonly used in analyzing high-dimensional data [29], also provided insights into trained DRL agents' behavior [362]. A primary objective of this study was to determine whether the developed language could be interpreted by human observers. To achieve this goal, they mapped learned symbols onto conventional naming conventions by augmenting the signaling game with a supervised image labeling task (encouraging senders to utilize standard terminology for enhanced human understanding). To assess interpretability, they conducted a crowdsourced survey in which human participants acted as receivers, evaluating how well humans could comprehend and utilize these extended communication systems. Their findings revealed that participants achieved correct identification in 68% of cases
Similarly, Mordatch and Abbeel [225] investigated the emergence of language with the difference that in their setting there were no explicit roles for the agents (i.e., sender or receiver). To learn, they proposed an end-to-end differentiable model of all agent and environment state dynamics over time to calculate the gradient of the return with backpropagation.
类似地,Mordatch和Abbeel[225]研究了语言的出现,不同之处在于在他们的设置中,代理(即发送者或接收者)没有明确的角色。为了学习,他们提出了一个端到端的可微模型,该模型反映了所有智能体和环境状态动态随时间的变化,以计算反向传播的回报梯度。
3.4 Learning communication
3.4 学习交流
As we addressed in the preceding section, one of the anticipated emerging properties of multiagent interaction is the spontaneous development of communication [[CR: https://link.springer.com/article/... "CR"], [CR: https://link.springer.com/article/... "CR"]]. The setting typically involves a group of cooperative agents operating within a partially observable framework (see Section [CR: https://link.springer.com/article/... "CR"]). These agents must adopt effective communication strategies to maximize their collective utility.
Reinforcement-based Inter-Agent Learning (RIAL) and Differentiable Inter-Agent Learning (DIAL) represent two distinct approaches that leverage deep neural networks to facilitate inter-agent communication. These methodologies employ a neural network architecture designed to generate Q-values, akin to standard reinforcement learning frameworks, alongside the transmission of messages intended for other agents in subsequent time steps. RIAL's foundation is rooted in Deep recurrent quasi-Newtonian methods (DRQN), incorporating the concept of parameter sharing across all agents through a single network whose weights are collaboratively optimized. Conversely, DIAL introduces a novel mechanism where gradients are transmitted through the communication channel during training, enabling direct encoding of message content into action selection processes.
Memory-driven (MD) communication was introduced on top of the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [206] approach. In MD-MADDPG [ 256], agents utilize a shared memory as a means for communication: prior to executing an action, each agent first reads from memory before responding by writing their output. This modification results in agents' policies now depending on their private observations and their interpretation of collective memory content within shared memory structures. The experiments involved two agents operating in cooperative scenarios, revealing that communication patterns varied significantly across environments: specifically, simpler tasks saw reduced memory activity near task completion due to static environmental states; conversely, more complex tasks exhibited increased frequency of memory usage linked to multiple subtasks present within their operational contexts.
Dropout [301] 是一种防止神经网络过度拟合的技术(在监督学习中这一现象表现为学习算法仅在特定数据集上表现出色而在泛化能力上则存在明显不足),它通过在训练阶段随机丢弃单元及其连接来实现这一目的。受到 Dropout 方法的启发 Kim 等人 [173] 在允许智能体间通过消息进行直接通信的多智能体环境中提出了一种类似的方法。在这种设定下 其他智能体的消息在训练阶段被丢弃 因此作者提出了 Message-Dropout MADDPG 算法 [173]。该方法适用于完全或有限通信环境下的应用。实验证明 当适当选择消息丢弃率时 所述方法能够在执行过程中显著提高训练速度并增强策略鲁棒性(通过引入通信错误)。这种特性对于从模拟或受控环境训练得到的 MDRL 代理转移到更具真实性的环境时具有重要意义
While RIAL and DIAL employed point-to-point discrete communication channels, CommNet [312] utilized continuous vector-based communication channels. The former method facilitates direct data transmission between two nodes, whereas the latter enables multi-dimensional signal processing. The communication channel in question allows for simultaneous data transfer from multiple sources to a single receiver. Through this channel, agents receive the aggregated signals transmitted by other agents. The authors assume full cooperation among all agents and train a unified model encompassing all participants. The two key features of CommNet distinguish it from prior approaches: it supports multiple communication cycles within each timestep and incorporates dynamic agent variations during runtime, meaning that agents can join or leave the environment as needed.
Compared to earlier methods, the Multiagent Bidirectionally Coordinated Network (BiCNet) [253] introduces communication within the latent space, specifically within hidden layers. Notably, while it employs parameter sharing, it innovates by utilizing bidirectional recurrent neural networks [285] to simulate both actor and critic networks within its architecture. It is important to observe that in BiCNet, agents do not explicitly share messages, thereby establishing this framework as a methodology for learning cooperative behaviors.
Learning communication represents a hot topic in the field of Multi-Agent Deep Reinforcement Learning (MDRL), as it encompasses numerous unresolved issues. In this regard, we direct readers' attention to a recent study by Lowe et al. [205] https://link.springer.com/article/10.1007/s10458-019-09421-1#CR2. This work explores common challenges and provides valuable insights into circumventing these issues when assessing communication within multiagent systems.
3.5 Learning cooperation
3.5 学习合作
Despite the emergence of explicit communication as a novel trend in Model-based Deep Reinforcement Learning (MDRL), there is a substantial body of prior work within Multi-agent Learning (MAL) that focuses on cooperative settings without any form of explicit communication, as evidenced by sources such as Footnote ¹² [2¹³, ²⁴⁸]. Consequently, this domain naturally serves as an ideal starting point for exploring contemporary advancements in MDRL.
Foerster等[99]对与独立Q学习智能体合作的简单场景展开了深入调查(见第3.1节),其中智能体采用了神经网络的标准DQN架构并借助经验回放缓冲区(见图3)。然而由于多主体环境的存在这些假设不再适用即在ER中生成的数据动态与当前环境不符导致采样体验过时[99,194]。为此他们提出了一种解决方案即为每个体验元组添加辅助信息以助于区分重采样的数据年龄
注
宽大DQN(LDQN)[247]将宽大这一概念[37](最初在MAL框架中提出)进行了扩展,并将其应用到多 Agent强化学习(MDRL)中。为了克服相对过度泛化的病理学问题[249, 250, 347] ,宽大DQN的设计初衷是减轻过渡噪声的影响,并防止智能体被次优但显著 reward 的宽峰所误导[246] 。与其它旨在解决相对过度泛化问题的方法类似(如分布式Q学习[181] 和滞后Q学习[213] ),LDQN的学习者最初采用了乐观的态度来处理这些挑战[99] 。然而,在经验回放缓冲区(ER buffer)中采样的过程中引入了新的解决方案:在经验元组中添加宽大值信息以构建条件判断机制 。这种设计使得当样本无法满足宽大条件时就会被筛选掉 ,从而有效避免了无效样本对学习过程的影响。
In a similar vein, Decentralized-Hysteretic Deep Recurrent Q-Networks (DEC-HDRQNs) [244] were proposed for fostering cooperation among independent learners. The motivation is similar to LDQN, making an optimistic value update, however, their solution is different. Here, the authors took inspiration from Hysteretic Q-learning [213], originally presented in MAL, where two learning rates were used. A difference between lenient agents and hysteretic Q-learning is that lenient agents are only initially forgiving towards teammates. Lenient learners over time apply less leniency towards updates that would lower utility values, taking into account how frequently observation-action pairs have been encountered. The idea being that the transition from optimistic to average reward learner will help make lenient learners more robust towards misleading stochastic rewards [37]. Additionally, in DEC-HDRQNs the ER buffer is also extended into concurrent experience replay trajectories , which are composed of three dimensions: agent index, the episode, and the timestep; when training, the sampled traces have the same starting timesteps. Moreover, to improve on generalization over different tasks, i.e., multi-task learning[62], DEC-HDRQNs make use of policy distillation [146, 273] (see Sect. 4.1). In contrast to other approaches, DEC-HDRQNS are fully decentralized during learning and execution.
同样,提出了分散式滞后深度循环Q网络(DEC-HDRQNs)[244],用于促进独立学习者之间的合作。动机与LDQN相似,进行乐观的价值更新,但是,他们的解决方案不同。在这里,作者从滞后Q学习[213]中汲取灵感,该学习最初在MAL中提出,其中使用了两种学习率。宽容的智能体和滞后型的Q学习之间的区别在于,宽容的智能体最初只是对队友的宽容。随着时间的流逝,宽容的学习者对会降低效用值的更新采取较少的宽容态度,同时考虑到遇到观察-行动对的频率。这个想法是,从乐观到平均奖励学习者的转变将有助于使宽容的学习者对误导性的随机奖励更加稳健[37]。此外,在 DEC-HDRQN 中,ER 缓冲区还扩展到并发体验回放轨迹,这些轨迹由三个维度组成:智能体索引、剧集和时间步长;训练时,采样的跟踪具有相同的开始时间步长。此外,为了提高对不同任务的泛化能力,即多任务学习[62],DEC-HDRQNs利用了策略蒸馏[146,273](见第4.1节)。与其他方法相比,DEC-HDRQNS在学习和执行过程中是完全去中心化的。
Weighted Double Deep Q-Network (WDDQN) [365] is based on having double estimators. This idea was originally introduced in Double Q-learning [130] and aims to remove the existing overestimation bias caused by using the maximum action value as an approximation for the maximum expected action value (see Sect. 4.1). It also uses a lenient reward [37] to be optimistic during initial phase of coordination and proposes a scheduled replay strategy in which samples closer to the terminal states are heuristically given higher priority; this strategy might not be applicable for any domain. For other works extending the ER to multiagent settings see MADDPG [206], Sects. 4.1 and 4.2.
加权双深度Q网络(WDDQN)[365]基于双重估计器。这个想法最初是在Double Q-learning[130]中引入的,旨在消除由于使用最大动作值作为最大期望动作值的近似值而导致的现有高估偏差(见第4.1节)。它还使用宽松的奖励 [ 37] 在协调的初始阶段保持乐观,并提出了一种预定的重放策略,其中更接近最终状态的样本被启发式地赋予更高的优先级;此策略可能不适用于任何域。有关将 ER 扩展到多智能体设置的其他工作,请参见 MADDPG [ 206],第 4.1 和 4.2 节。

展示用于实现FTW(For the Win)[156]的架构概览的图示。该架构由两个展开递归神经网络(RNNs)组成,在不同的时间尺度上运行。其核心理念在于较慢运行的RNN有助于建立长期的时间相关性。观测值源自某些卷积神经网络在潜在空间中的输出结果,用于学习非线性特征。此外,在单智能体强化学习领域中的另一项研究工作[338]也维持了一个多时间尺度层次结构,在其中也维持了一个多时间尺度层次结构——较慢运行的网络设定目标,在此过程中较快运行的网络则试图实现这些目标。反过来,Fedual Networks继承于RL领域的早期工作[82, 296]中提出的Q-learners层级结构


Lowe et al. observed that employing standard policy gradient techniques (as detailed in Section 2.1) in multi-agent environments leads to high variance and suboptimal performance. This is due to the fact that variance increases further, as all agents' rewards are influenced by the actions of the other agents, a relationship that has been formally established to show that as the number of agents grows, the likelihood of correctly identifying the gradient direction decreases exponentially [206]. To address this challenge, Lowe et al. introduced the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm [206], building upon the foundation of DDPG [192] (as elaborated in Section 2.2). The MADDPG approach involves training a centralized critic for each agent during the learning process, which is provided with all agents' policies to reduce variance by eliminating non-stationarity caused by concurrent learning. In this setup, each agent operates with only local information (thereby transforming the method into a centralized training framework with decentralized execution), while an experience replay buffer records experiences from all agents. The MADDPG algorithm was evaluated in both cooperative and competitive settings, with experimental results demonstrating its superior performance compared to decentralized methods such as DQN, DDPG, and TRPO. The authors emphasize that conventional reinforcement learning methods fail to provide consistent gradient signals, a limitation particularly evident in complex competitive environments where agents continuously adapt to one another, leading to oscillating best-response policies. MADDPG has been shown to learn more robustly than DDPG in such scenarios.
Another approach utilizing policy gradients is the Counterfactual Multi-Agent Policy Gradients (COMA) [98]. This method, designed for fully centralized settings, addresses the challenge of multiagent credit assignment [332], which involves determining each agent's contribution in cooperative learning scenarios with only global rewards. COMA achieves this by proposing a counterfactual baseline, computed by marginalizing an agent's action while keeping others fixed. This baseline enables the calculation of an advantage function by comparing current Q-values to the counterfactual. The concept of a counterfactual baseline originates from difference rewards [332], a technique for isolating individual agent contributions in cooperative multiagent teams. Specifically, the aristocrat utility aims to measure deviations between an agent's actual behavior and average behavior [355]. This approach effectively sidesteps an agent by considering rewards independent of its actions, akin to evaluating rewards in a world without that agent [see Section 4.2].
On one hand, completely centralized approaches, such as COMA, exhibit no non-stationarity issues but face scalability limitations. On the other hand, independent learning agents are more adaptable to scaling challenges yet grapple with non-stationarity issues. Several hybrid methodologies focus on learning centralized yet factored Q functions [119, 174]. Value Decomposition Networks (VDNs) [313] employ factorization concepts rather than summations; instead, QMIX [266] relies on factorization concepts rather than summations; instead of summing individual values, QMIX employs a mixing network that integrates local values through non-linear combinations. While these methods have achieved notable empirical success [266], challenges persist in leveraging function approximators for value-function factorization in multiagent scenarios. Current research efforts explore how well such factorizations can capture complex coordination problems and their capacity for learning these factorizations [64] (see Section 4.4).
3.6 Agents modeling agents
3.6 智能体建模智能体
The ability for agents to analyze the behaviors of other agents lies in building models that predict the modeled agents' outcomes. The early approach for modeling agents while employing deep neural networks was the Deep Reinforcement Opponent Network (DRON). The concept involves two key components: one dedicated to evaluating Q-values and another focused on learning the opponent's policy representation. The researchers proposed employing multiple expert networks whose predictions are aggregated to estimate the Q value, with each network specializing in a particular opponent strategy. These ideas align with earlier studies on type-based reasoning within game theory [129, 167] and have been integrated into AI applications [6, 26, 109]. The mixture of experts idea was presented in supervised learning where each expert handles a subset of data (a subtask), and then a gating network decides which expert should be utilized.
Fig. 6 图6
[

](https://link.springer.com/article/10.1007/s10458-019-09421-1/figures/6)

DRON uses hand-crafted features to define the opponent network. In contrast, Deep Policy Inference Q-Network (DPIQN) and its recurrent version, DPIRQN [148] learn policy features directly from raw observations of the other agents. The way to learn these policy features is by means of auxiliary tasks [158, 317] (see Sects. 2.2 and 4.1) that provide additional learning goals, in this case, the auxiliary task is to learn the opponents’ policies. This auxiliary task modifies the loss function by computing an auxiliary loss: the cross entropy loss between the inferred opponent policy and the ground truth (one-hot action vector) of the opponent. Then, the Q value function of the learning agent is conditioned on the opponent’s policy features (see Fig. 6), which aims to reduce the non-stationarity of the environment. The authors used an adaptive training procedure to adjust the attention (a weight on the loss function) to either emphasize learning the policy features (of the opponent) or the respective Q values of the agent. An advantage of these approaches is that modeling the agents can work for both opponents and teammates [148].
DRON 使用手工制作的功能来定义对手网络。相比之下,深度策略推理Q网络(DPIQN)及其循环版本DPIRQN[148]直接从其他智能体的原始观察中学习策略特征。学习这些政策特征的方法是通过辅助任务[158,317](见第2.2和4.1节),这些任务提供了额外的学习目标,在这种情况下,辅助任务是学习对手的政策。此辅助任务通过计算辅助损失来修改损失函数:推断的对手策略与对手的地面实况(一热动作向量)之间的交叉熵损失。然后,学习智能体的Q值函数以对手的策略特征为条件(见图6),旨在降低环境的非平稳性。作者使用自适应训练程序来调整注意力(损失函数的权重),以强调学习(对手的)政策特征或智能体的相应Q值。这些方法的一个优点是,对智能体进行建模既可以对对手也对队友起作用[148]。
在众多先前的研究中发现,在观察中学习对手模型是一种常见的做法。自他者建模(Self Other Modeling, SOM)[265]则采用了另一种策略:利用智能体自身的策略来预测对手的行为。该方法适用于协作与竞争环境(支持任意数量的智能体)并推断其他智能体的目标。这一特点很重要,在被评估的领域中奖励函数取决于各智能体的目标设定。SOM采用了两个网络系统:一个用于计算代理者的策略;另一个用于推断他者的意图目标。其基本思路是这些网络共享相同的输入参数但拥有不同的值(代理者或是他者)。与以往的方法不同的是 SOM 并不专注于学习对方的行为模式(即下一动作的概率分布),而是估计对方的目标意图。当多个智能体共享一组目标时 SOM 方法表现最佳:每个智能体在某一集会周期开始时都会从这组目标集合中选择一个目标;并且奖励函数取决于双方所选目标之间的相互关系。尽管结构简单明了,在考虑到其他智能体的操作时会延长训练时间
There is a long-standing history of combining game theory and MAL [43, 233, 289]. From that context, some approaches were inspired by influential game theory approaches. Neural Fictitious Self-Play (NFSP) [136] builds on fictitious (self-) play [49, 135], together with two deep networks to find approximate Nash equilibriaFootnote14 in two-player imperfect information games [341] (for example, consider Poker: when it is an agent’s turn to move it does not have access to all information about the world). One network learns an approximate best response (e−greedy over Q values) to the historical behavior of other agents and the second one (called the average network) learns to imitate its own past best response behaviour using supervised classification. The agent behaves using a mixture of the average and the best response networks depending on the probability of an anticipatory parameter [287]. Comparisons with DQN in Leduc Hold’em Poker revealed that DQN’s deterministic strategy is highly exploitable. Such strategies are sufficient to behave optimally in single-agent domains, i.e., MDPs for which DQN was designed. However, imperfect-information games generally require stochastic strategies to achieve optimal behaviour [136]. DQN learning experiences are both highly correlated over time, and highly focused on a narrow state distribution. In contrast to NFSP agents whose experience varies more smoothly, resulting in a more stable data distribution, more stable neural networks and better performance.
博弈论与MAL的结合由来已久[43,233,289]。从这个背景来看,一些方法受到有影响力的博弈论方法的启发。神经虚构自我博弈(NFSP)[136]建立在虚构(自我)博弈[49,135]的基础上,结合两个深度网络,在双人不完全信息博弈 Footnote14 [341]中找到近似纳什均衡(例如,考虑扑克:当轮到智能体移动时,它无法获得有关世界的所有信息)。一个网络学习对其他智能体的历史行为的近似最佳响应(对 Q 值的 e− 贪婪),第二个网络( 称为平均网络)学习使用监督分类模仿自己过去的最佳响应行为。智能体的行为是平均响应网络和最佳响应网络的混合,这取决于预期参数的概率[287]。与 Leduc Hold'em Poker 中的 DQN 的比较表明,DQN 的确定性策略具有高度可利用性。这些策略足以在单代理域(即设计了 DQN 的 MDP)中发挥最佳作用。然而,不完全信息博弈通常需要随机策略来实现最佳行为[136]。DQN 学习体验在一段时间内高度相关,并且高度关注狭窄的状态分布。与NFSP代理相比,NFSP代理的体验变化更平滑,从而产生更稳定的数据分布、更稳定的神经网络和更好的性能。

The (N)FSP concept was further generalized in Policy-Space Response Oracles (PSRO) [180], where it was shown that fictitious play is one specific meta-strategy distribution over a set of previous (approximate) best responses (summarized by a meta-game obtained by empirical game theoretic analysis [342]), but there are a wide variety to choose from. One reason to use mixed meta-strategies is that it prevents overfittingFootnote15 the responses to one specific policy, and hence provides a form of opponent/teammate regularization. An approximate scalable version of the algorithm leads to a graph of agents best-responding independently called Deep Cognitive Hierarchies (DCHs) [180] due to its similarity to behavioral game-theoretic models [59, 72].
(N)FSP概念在政策空间响应预言机(PSRO)[180]中得到进一步推广,其中表明虚构博弈是一组先前(近似)最佳响应(由经验博弈论分析获得的元博弈[342])上的一个特定的元策略分布,但有多种选择。使用混合元策略的一个原因是,它可以防止对特定策略 Footnote15 的响应过度拟合,从而提供一种对手/队友正则化的形式。该算法的近似可扩展版本导致了独立响应最佳的智能体图,称为深度认知层次结构(DCH)[180],因为它与行为博弈论模型相似[59,72]。
In game theory, minimax serves as a foundational principle, often summarized by the strategy of minimizing potential losses through preparation for the worst-case scenario[[341]]. Li et al. [[190]] utilized minimax principles to develop robust learning strategies in multiagent environments, ensuring that policies remain effective even against unforeseen opponent behaviors. The extension of MADDPG to M3DDPG involved incorporating minimax objectives to address challenges in multiagent learning. This approach yielded a minmax learning objective, which presents significant computational challenges for direct optimization. To mitigate this issue, they drew inspiration from robust reinforcement learning techniques[[227]], which implicitly adopt minimax concepts by considering worst-case perturbations[[257]]. In MAL, various methods were proposed to evaluate algorithmic robustness, including guarantees of safety[[66], [259]], security measures[[73]], and assessments of exploitability[[80], [161], [215]].
以往的方法通常通过建立对其他智能体的模型来预测其行为。然而,在这些方法中,并未明确考虑到其他智能体的预期学习这一关键点。这正是Learning with Opponent-Learning Awareness(LOLA)所追求的目标[97]。LOLA通过让对手策略更新一步后优化预期回报的方式来实现这一目标[97]。具体而言,在每一轮互动中,在确定自己的动作之前,一个LOLA代理会先让对手进行一次策略更新的动作选择过程,并在此基础上计算自己的最佳应对策略[363]。
Mind is part of a category of recursive reasoning methods[60,61,109,110], where agents hold clear beliefs about the mental states of others. These mental states can themselves contain other beliefs and representations about others' mental states, leading to a recursive structure[6]. Mind Network (MNet)[263] starts with the premise that encountering a novel opponent requires an agent to already possess a strong and well-founded understanding of how that opponent would behave. MNet's architecture comprises three key components: (i) a character network trained on historical data, (ii) a mental state network that processes the character outputs along with recent behavioral trajectories, and (iii) a predictive network that takes the current state and outputs from the other networks as inputs. The system's output is designed to address diverse scenarios while primarily aiming to forecast an opponent's next action. One significant advantage of MNet is its versatility in predicting behavior across all agents or for specific ones when needed.
Deep Bayesian Theory of Mind Policy (Bayes-ToMoP) [358] 是另类源自理论思维[76]的思想。该算法假设对手采取不同固定策略,并随着时间推移会发生转变[140]。MAL领域的早期研究针对此场景开展工作,例如BPR+[143]将贝叶斯策略重用技术拓展到多智能体环境中(BPR算法基于单智能体环境假设;而BPR+旨在实现对对手的最佳应对)。BPR+面临一个显著局限性即其自身性能欠佳(自玩表现不佳),因此Deep Bayes-ToMoP通过理论思维构建了更高层次的推理机制以提升对抗BPR+代理的效果
Deep BPR+ [366] draws inspiration from BPR+ and employs neural networks as value-function approximators, making it distinct from other methods. In addition to incorporating the environment reward, this approach integrates an online learned opponent model [139,144] to construct a rectified belief over the opponent strategy. Furthermore, it innovates by combining concepts from policy distillation [146,273], extending them to multiagent scenarios for creating a distilled policy network. When a new acting policy is learned, the system applies distillation to consolidate the updated library, enhancing its storage capacity and generalization ability compared to existing strategies.
4 Bridging RL, MAL and MDRL
4 桥接 RL、MAL 和 MDRL
本节旨在为促进子社区之间富有成效的合作提供指导方针。首先,我们解决了深度学习记忆缺陷的问题,这一现象通常表现为对原始文献的引用缺失以及未能充分利用过去取得的进步。通过提供具体实例,我们展示了早期概念如何从强化学习 (RL) 和多标签分类 (MAL) 中成功扩展至多任务深度强化学习 (MDRL)(见第4.1节)。其次,我们总结了本次调查中所分析作品的经验与教训(见第4.2节)。随后,我们介绍了最新版本的MDRL基准(见第4.3节),并探讨了该领域面临的实际问题,例如计算需求高及可重复性挑战(见第4.4节)。最后,我们提出了若干有待进一步研究的问题,并考察了这些问题与《仲裁示范法》[6]中先前未解决的问题之间的关联(见第4.5节)
4.1 Avoiding deep learning amnesia: examples in MDRL
4.1 避免深度学习遗忘症:MDRL 中的示例
本研究重点考察了最近的深度工作,在描述近期算法时,请指出了启发这些算法的原始作品。Schmidhuber表示:“[机器学习]是学分分配的科学;机器学习社区自身从其成员的适当学分分配中获益[28]。”在此背景下,我们希望避免犯下不承认早期原创思想这一陷阱——也就是所谓的深度学习遗忘症。我们的目的是就在于强调现有文献中包含的相关思想与算法不容忽视;相反,则应予以考察与引用[58,79]以了解最新的进展[343]
处理独立学习者的非平稳性
Multiagent credit assignment In cooperative multiagent scenarios, it is common to use either local rewards , unique for each agent, or global rewards , which represent the entire group’s performance [3]. However, local rewards are usually harder to obtain, therefore, it is common to rely only on the global ones. This raises the problem of credit assignment : how does a single agent’s actions contribute to a system that involves the actions of many agents [2]. A solution that came from MAL research that has proven successful in many scenarios is difference rewards [3, 86, 332], which aims to capture an agent’s contribution to the system’s global performance. In particular the aristocrat utility aims to measure the difference between an agent’s actual action and the average action [355], however, it has a self-consistency problem and in practice it is more common to compute the wonderful life utility [355, 356], which proposes to use a clamping operation that would be equivalent to removing that player from the team. COMA [98] builds on these concepts to propose an advantage function based on the contribution of the agent, which can be efficiently computed with deep neural networks (see Sect. 3.5).
多智能体信用分配 在合作多智能体场景中,通常使用局部奖励(每个智能体唯一)或全局奖励(代表整个群体的绩效)[ 3]。然而,本地奖励通常更难获得,因此,通常只依赖全球奖励。这就提出了一个信用分配的问题:单个智能体的行为如何有助于一个涉及多个智能体行为的系统[2]。来自MAL研究的一个解决方案,在许多情况下被证明是成功的,是差异奖励[3,86,332],它旨在捕获智能体对系统全局性能的贡献。特别是贵族效用旨在衡量智能体的实际行动与平均行动之间的差异[355],然而,它有一个自洽问题,在实践中,计算奇妙的生活效用[355,356]更为常见,它建议使用相当于将该球员从团队中移除的夹紧操作。COMA [ 98] 基于这些概念提出了一个基于智能体贡献的优势函数,该函数可以用深度神经网络进行有效计算(参见第 3.5 节)。
Multi-task learning in the context of reinforcement learning (RL) refers to an area dedicated to developing agents capable of performing multiple related tasks rather than just one[324]. The concept of distillation, roughly defined as transferring knowledge from a large model to a small model, was originally introduced for supervised learning and model compression[52,
辅助任务 Jaderberg 等[158]提出了辅助任务这一术语,并指出(单智能体)系统包含多种可能的训练信号(例如像素变化)。这些任务自然地在深度强化学习(DRL)中实现,在这种情况下最后一层被划分为多个部分(头部),每个头部专注于不同的任务。所有头部都会将错误传播到网络共享前半部分区域,并尝试在其倒数第二层形成表示以支持所有头部[315]。然而,在一般价值函数的背景下提出多重预测的想法最初是针对强化学习提出的[315,317],仍然存在一些未解决的问题,例如更深入的理论理解[31,88]。在神经网络领域中早先的研究提出了改善网络性能和训练时间的方法[142,225]。Suddarth 和 Kergosien[311]提供了一个小型神经网络的例子来说明通过添加辅助任务可以有效消除局部极小值点;在此基础上人们可以进一步拓展这些辅助任务来建模其他智能体的行为模式。
Experience replay was introduced by Lin [
双估计器 双Q学习[ ₁₃₀ ]旨在减少Q学习中动作值的高估这一问题 ,这种高估现象源于将最大动作值作为最大期望动作值近似值所导致的结果 。双Q学习通过保留两个Q函数得以实现 ,并被证明能够收敛到最优策略[ ₁₃₀ ] 。随后 ,这一思路被推广至任意函数逼近器领域 ,包括深度神经网络等复杂结构 ,即双DQN[ ₃₃₆ ] 。由于在DQN算法中已经设置了两个神经网络结构(参见第₂.2节) ,因此这种设计思路得以自然地应用于这些复杂网络之中 。近年来 ,这些思路也被成功应用至多任务强化学习(MDRL)[ ³⁶₅ ]领域之中。
4.2 Lessons learned
4.2 经验教训
We have demonstrated the means by which RL and MAL can be extended to MDRL settings. Now, we provide a detailed overview of the best practices learned from the works analyzed throughout this paper.

Centralized learning with decentralized execution Many MAL works were either fully centralized or fully decentralized approaches. However, inspired by decentralized partially observable Markov decison processes (DEC-POMDPs) [34, 237],Footnote17 in MDRL this new mixed paradigm has been commonly used [98, 99, 180, 206, 247, 266] (a notable exception are DEC-HDRQNs [244] which perform learning and execution in a decentralized manner, see Sect. 3.5). Note that not all real-world problems fit into this paradigm and it is more common for robotics or games where a simulator is generally available [96]. The main benefit is that during learning additional information can be used (e.g., global state, action, or rewards) and during execution this information is removed.
集中学习与分散执行 许多MAL工作要么是完全集中的,要么是完全分散的。然而,受分散部分可观察马尔可夫决策过程(DEC-POMDP)[34,237]的启发,在MDRL中,这种新的混合范式已被普遍使用[98,99,180,206,247,266](一个值得注意的例外是DEC-HDRQNs[244] Footnote17 ,它以分散的方式执行学习和执行,见第3.5节)。请注意,并非所有现实世界的问题都适合这种范式,在机器人或游戏中更为常见,因为模拟器已经普遍可用[96]。主要好处是,在学习过程中可以使用其他信息(例如,全局状态、操作或奖励),并在执行过程中删除这些信息。
Parameter sharing is a common component in many Model-based Deep Reinforcement Learning (MDRL) approaches. It represents the concept of shared parameters within a single network where multiple agents share their weights. While agents might receive varying observations—such as in partially observable environments—they can exhibit distinct behaviors. This approach emerged simultaneously across various studies [96 and 124], subsequently gaining widespread adoption through applications at positions 99 and 253 among others.
Recurrent networks Recurrent neural networks (RNNs) enhanced neural networks with a memory capability, however, they suffer from the vanishing gradient problem, which renders them inefficient for long-term dependencies [252]. However, RNN variants such as LSTMs [114, 147] and GRUs (Gated Recurrent Unit) [67] addressed this challenge. In single-agent DRL, DRQN [131] initially proposed idea of using recurrent networks in single-agent partially observable environments. Then, Feudal Networks [338] proposed a hierarchical approach [82], multiple LSTM networks with different time-scales , i.e., the observation input schedule is different for each LSTM network, to create a temporal hierarchy so that it can better address the long-term credit assignment challenge for RL problems. Recently, the use of recurrent networks has been extended to MDRL to address the challenge of partially observability [24, 96, 148, 244, 253, 263, 265, 266, 313] for example, in FTW [156], depicted in Fig. 5 and DRPIRQN [148] depicted in Fig. 6. See Sect. 4.4 for practical challenges (e.g., training issues) of recurrent networks in MDRL.
递归网络 递归神经网络(RNNs)增强了具有记忆能力的神经网络,然而,它们受到梯度消失问题的影响,这使得它们在长期依赖性方面效率低下[252]。然而,RNN变体,如LSTM [ 114, 147] 和 GRU (门控循环单元) [ 67] 解决了这一挑战。在单智能体DRL中,DRQN[131]最初提出了在单智能体部分可观测环境中使用循环网络的想法。然后,封建网络[ 338]提出了一种分层方法[ 82],将多个具有不同时间尺度的LSTM网络,即每个LSTM网络的观测输入时间表不同,以创建时间层次结构,从而更好地解决RL问题的长期信用分配挑战。最近,循环网络的使用已扩展到MDRL,以解决部分可观测性的挑战[24,96,148,244,253,263,265,266,313],例如,在FTW [156]中,如图5所示,DRPIRQN [ 148]如图6所示。参见第 4.4 节,了解 MDRL 中循环网络的实际挑战(例如,培训问题)。
Overfitting in MAL In single-agent RL, agents may overfit to the environment[352]. Similar issues arise in multiagent settings[160], where agents might overfit—meaning an agent's policy could easily get trapped in a local optimum, rendering the learned policy only locally optimal relative to other agents' current strategies[190]. This constrains the generalization of learned policies[180]. To mitigate this, one approach is to employ an ensemble of policies and learn from them or respond optimally to their mixture[133, 180, 206]. Another approach is to enhance algorithm robustness—a robust policy should maintain effective performance even when faced with alternative strategies beyond its training scope[190].
4.3 Benchmarks for MDRL
4.3 MDRL的基准
The Arcade Learning Environment (ALE) [32, 211] and OpenAI Gym [48] have established standardized environments that have enabled single-agent reinforcement learning to move beyond toy domains. For deep reinforcement learning (DRL), there are open-source frameworks that provide compact and reliable implementations of state-of-the-art DRL algorithms [65]. Although multi-agent deep reinforcement learning (MDRL) is a relatively new research area, there are now numerous open-source simulators and benchmark platforms available, each with distinct characteristics that we will elaborate on below.
Fully Cooperative Multiagent Object Transporation Problems (CMOTPs)Footnote18 were originally presented by Busoniu et al. [56] as a simple two-agent coordination problem in MAL. Palmer et al. [247] proposed two pixel-based extensions to the original setting which include narrow passages that test the agents’ ability to master fully-cooperative sub-tasks, stochastic rewards and noisy observations, see Fig. 7a.
完全协作多智能体对象传输问题(CMOPPs) Footnote18 最初由Busoniu等[56]提出,是MAL中一个简单的双智能体协调问题。Palmer等[247]提出了两个基于像素的原始设置扩展,其中包括测试智能体掌握完全合作子任务、随机奖励和嘈杂观察能力的狭窄通道,见图7a。
The Apprentice Firefighter Game (inspired by the classic climbing game [70]) is another two-agent pixel-based environment that simultaneously challenges learners with four MAL-related pathologies: relative overgeneralization, stochasticity, the moving target problem, and alter exploration issue [246].
Pommerman [[267]] 是一个多智能体基准测试平台,在多个合作、竞争及混合(合作与竞争)场景中展现出广泛的应用价值。该平台不仅具备代理间的部分可观测性和通信能力(见图 7b),而且从探索角度来看也面临着巨大的挑战——奖励机制过于稀疏且延迟反馈明显[[107]]。近期的一场人工智能领域顶级赛事——NeurIPS-2018[[Footnote20]] 上取得了显著成果,并提供了若干优秀参赛者供研究人员深入研究和学习。
The Starcraft Multiagent Challenge [276] was derived from the real-time strategy game StarCraft II and centers on micro-management challenges, wherein precise control of individual units is essential. Each unit in this setup is managed by an independent agent that must act based on localized observations. The study also incorporates a state-of-the-art Multi-Agent Dynamic Reinforcement Learning (MDRL) framework, which includes algorithms like QMIX and COMA.
This competition in Malmö (MARLÖ) is another challenge involving multiple cooperative 3D games [Footnote]. These game scenarios have been developed using the open-source Malmö platform [..], offering illustrative examples of various types of multiagent cooperative, competitive, and mixed scenarios that can be explored within Minecraft.
Hanabi consists of a cooperative multiplayer card game involving players ranging from two to five. Its primary feature lies in the fact that each participant fails to observe their own cards while other players are capable of providing insights into their own cards. This unique challenge presents an interesting opportunity for learning algorithms, particularly within contexts such as self-play learning and ad-hoc team formations[ 5, 44, 304 ]. The Hanabi Learning Environment[ 25 ] has recently been launched Footnote24, which is accompanied by a baseline (deep reinforcement learning) agent serving as the reference point for subsequent research efforts[ 145 ].
Arena [ 298 ] 是一个基于 [ Unity 引擎 [ ① ] ] 的多智能体研究平台 [ Footnote ② ] 。它整合了 35 部多智能体游戏(如真实的社会困境模拟)以及智能体之间的通信功能。 Arena 平台还提供了近年来一些主流深度强化学习算法的基本实现方案(如独立运行的 PPO 学习器)。
MuJoCo Multiagent Soccer [203] uses the MuJoCo physics engine [327]. The environment simulates a 2 vs. 2 soccer game with agents having a 3-dimensional action space.Footnote26
MuJoCo Multiagent Soccer [ 203] 使用 MuJoCo 物理引擎 [ 327]。该环境模拟了一场 2 对 2 的足球比赛,代理具有 3 维动作空间。 Footnote26
neural MMO [ 308 ] 是一个研究平台 [ Footnote27 ] ,其灵感来自人类游戏类型中的 大规模多玩家 在线 角色扮演游戏(MOORPG)。这类游戏通常会吸引大量玩家参与竞争与生存挑战。
这些游戏涉及大量数量可变的玩家为生存而竞争。
Fig. 7 图7
[

](https://link.springer.com/article/10.1007/s10458-019-09421-1/figures/7)
a
4.4 Practical challenges in MDRL
4.4 MDRL的实际挑战
In this section, we present a critical perspective on MDRL and outline various practical challenges that emerge in DRL and are likely to arise in MDRL, such as reproducibility issues, the need for hyperparameter tuning, resource requirements, and the potential for conflated results. We offer insights into potential strategies to address some of these challenges partially.
可重复性、令人不安的趋势及负面结果
Implementation difficulties and hyperparameter tuning are significant concerns in DRL algorithm implementations. While standard implementations of DRL algorithms often include essential non-trivial optimizations—these are frequently necessary for achieving satisfactory performance [151]. Recent studies, such as that by Tucker et al. [331], have highlighted that several published works on action-dependent baselines contain critical errors and flaws—these inaccuracies actually account for the high performance observed in experimental results, rather than being attributed solely to the proposed methodology. Melis et al. [216] conducted a comparative analysis of various works, focusing on incremental advancements in network architectures and the original LSTM model [147] (introduced in 1997), which forms the foundation of many sequence processing tasks. Their findings revealed that LSTMs demonstrated superior performance compared to more recent models when properly configured. In this context, Lipton and Steinhardt pointed out that the broader research community could benefit significantly from a deeper understanding of hyperparameter tuning mechanisms [197]. A plausible explanation for this unexpected outcome might be the inherent difficulty in training certain network architectures [252], coupled with recent discoveries indicating potential issues arising from recurrent network usage in DRL frameworks [78, 95, 106, 123]. Another notable complication is catastrophic forgetting (see Section [2.2]), which has garnered attention in recent DRL research [264, 336], and we anticipate similar challenges manifesting in MDRL systems. Henderson et al. [137] provided a comprehensive analysis of hyperparameter tuning effects within the DRL domain, demonstrating that these parameters can exhibit highly variable impacts across different algorithms (e.g., TRPO, DDPG, PPO, and ACKTR) and environments due to intricate interactions among them [137]. The authors emphasize the importance of transparently reporting all experimental parameters used in comparisons—this practice should be encouraged not only for MDRL but also as a broader research standard. Furthermore, it is worth noting that hyperparameter tuning is closely linked to issues surrounding data selection or "cherry-picking," as it enables researchers to identify parameter sets that optimize algorithm performance under specific conditions (see earlier challenges). Finally, given the substantial computational resources required for hyperparameter optimization tasks, this challenge naturally leads to discussions about resource allocation and efficiency optimization within broader computational demands frameworks
Computational resources in deep reinforcement learning typically require millions of interactions for an agent to learn, as evidenced by reference [9], which highlights the low sample efficiency documented in reference [361]. This underscores the necessity for extensive computational infrastructure in general. The original A3C implementation, referenced in [219], utilized 16 CPU workers over 4 days to train an agent to play an Atari game, accumulating a total of 200 million training frames (results from 57 Atari games are documented in Footnote29). Distributed PPO, referenced in [134], employed 64 workers (assuming one CPU per worker, though this is not explicitly stated in the paper) over a period of approximately 100 hours (more than 4 days) to train agents for locomotion tasks. In the context of model-based deep reinforcement learning (MDRL), as exemplified by the Atari Pong game referenced in [322], agents were trained for 50 epochs, each consisting of 250,000 time steps, resulting in a total of 1.25 million training frames. The FTW agent referenced in [156] leveraged parallel processing with 30 agents (processes), where each training session lasted approximately 5 minutes; these agents were trained across roughly 450,000 games, which equates to about 4.2 years of continuous training. These examples emphasize the significant computational demands often encountered in deep reinforcement learning and model-based deep reinforcement learning.
Recent studies have significantly accelerated the learning of Atari games, achieving impressive results within minutes (Stooke and Abbeel [306] demonstrated that DRL agents could be trained in under an hour using advanced hardware configurations comprising 8 GPUs and 40 cores). However, this remarkable progress remains an exception rather than the norm, with computational infrastructure serving as a critical bottleneck for advancing DRL and MDRL technologies. Particularly challenging are those without access to extensive computational resources (e.g., most corporations and academic research institutions) [29, 286]. Within this context, we propose two strategies to address these limitations. First, we advocate for enhancing awareness by encouraging researchers to report detailed computational metrics for new algorithms. This includes information on CPU/GPU usage rates, memory requirements, and total wall-clock computation times. Second, we emphasize prioritizing algorithmic innovation over computational efficiency in MDRL research. To ensure this approach's success, it must be supported by rigorous peer review processes.
We have emphasized raising awareness about computational requirements and presenting findings, yet an unresolved issue remains: what methods should we employ to measure and report? Several aspects influence efficiency: sample efficiency can be quantified through state-action pairs utilized during training; computational efficiency can be gauged via CPU/GPU numbers and training duration. What approaches can we take to evaluate the impact of additional resources, such as external data sources or annotations? Footnote31 Similarly, does differentiation between the algorithm's computational demands and its operational environment pose a similar question? We remain unclear about this matter, but we highlight that existing standard metrics may not capture all aspects adequately.
最终我们认为基于高计算的方法能够作为展示基准的前沿[235][339];即它们展示了随着数据和计算规模扩大所可能达到的结果(例如,在使用128,000 CPU核心和256个GPU的情况下每天生成180年的游戏数据[235];AlphaStar则利用了星际争霸II游戏玩法长达200年[339]);然而,在轻量级计算基础上采用算法化方法同样能够为更有效地解决现实世界的问题做出重大贡献
Identifying the simplest context that reveals an innovative research idea presents a significant challenge, and neglecting this could result in conflating fundamental research—operating principles within the most abstract framework—and applied research—practical systems designed with maximum completeness.
4.5 Open questions
4.5 开放性问题
Here, we present a series of open questions related to MDRL and suggest approaches to address them. We believe that the earlier literature contains valuable ideas and refer the reader to Section 4.1 for further insights to avoid deep learning amnesia.
Addressing the issue of sparse and delayed rewards presents a significant difficulty. The primary difficulty lies in overcoming the challenges associated with sparse and delayed rewards.
Recent MDRL competitions and environments have complex scenarios where many actions are taken before a reward signal is available (see Sect. 4.3). This sparseness is already a challenge for RL [89, 315] where approaches such as count-based exploration/intrinsic motivation [27, 30, 47, 279, 307] and hierarchical learning [87, 178, 278] have been proposed to address it—in MDRL this is even more problematic since the agents not only need to learn basic behaviors (like in DRL), but also to learn the strategic element (e.g., competitive/collaborative) embedded in the multiagent setting. To address this issue, recent MDRL approaches applied dense rewards [176, 212, 231] (a concept originated in RL) at each step to allow the agents to learn basic motor skills and then decrease these dense rewards over time in favor of the environmental reward [24], see Sect. 3.3. Recent works like OpenAI Five [235] uses hand-crafted intermediate rewards to accelerate the learning and FTW [156] lets the agents learn their internal rewards by a hierarchical two-tier optimization. In single agent domains, RUDDER [12] has been recently proposed for such delayed sparse reward problems. RUDDER generates a new MDP with more intermediate rewards whose optimal solution is still an optimal solution to the original MDP. This is achieved by using LSTM networks to redistribute the original sparse reward to earlier state-action pairs and automatically provide reward shaping. How to best extend RUDDER to multiagent domains is an open avenue of research.
最近的 MDRL 竞赛和环境具有复杂的场景,在奖励信号可用之前采取了许多行动(参见第 4.3 节)。这种稀疏性对于RL [ 89, 315]来说已经是一个挑战,其中基于计数的探索/内在动机[ 27, 30, 47, 279, 307] 和分层学习 [ 87, 178, 278] 等方法已被提出来解决它——在 MDRL 中,这更成问题,因为智能体不仅需要学习基本行为(如在 DRL 中),还需要学习战略要素(例如, 竞争/协作)嵌入在多智能体设置中。为了解决这个问题,最近的MDRL方法在每一步都应用了密集奖励[176,212,231](起源于RL的概念),以允许智能体学习基本的运动技能,然后随着时间的推移减少这些密集奖励,以支持环境奖励[24],参见第3.3节。最近的工作,如OpenAI Five[235]使用手工制作的中间奖励来加速学习,FTW [156]让智能体通过分层的两层优化来学习他们的内部奖励。在单智能体领域中,RUDDER [ 12] 最近被提出用于解决这种延迟稀疏奖励问题。RUDDER 生成具有更多中间奖励的新 MDP,其最优解仍然是原始 MDP 的最优解。这是通过使用 LSTM 网络将原始稀疏奖励重新分配给早期的状态-动作对并自动提供奖励整形来实现的。如何最好地将 RUDDER 扩展到多智能体域是一条开放的研究途径。
On the role of self-play.
关于自我发挥的作用。
Self-play represents a fundamental pillar within the MAL (Multi-Agent Learning) framework, as evidenced by its impressive outcomes documented in multiple studies [42,45,71,113,149]. Although significant results were also achieved within the MDRL (Multi-Decision Real-Time Learning) domain [43,136], recent research has highlighted that simplistic self-play approaches do not yield optimal performance. However, introducing diversity through evolutionary or sampling-based methods has proven effective, as demonstrated by several studies [20,85,185,271] and [24,156,187]. One notable drawback of these techniques is their increased computational demands due to the need for parallel processing or substantial memory resources. This raises an open question: Can we enhance the computational efficiency of these established methods? Specifically, can we achieve similar training stability with reduced population sizes and fewer CPU workers in both MAL and MDRL settings? Answers to these questions can be found in Section 4.4 and related discussions in Albrecht et al.'s work [6], particularly Section 5.5.
The challenges associated with the combinatorial nature of MDRL. The challenges pertaining to its combinatorial characteristics.
Monte Carlo tree search (MCTS) [51] has been the backbone of the major breakthroughs behind AlphaGo [291] and AlphaGo Zero [293] that combined search and DRL. A recent work [340] has outlined how search and RL can be better combined for potentially new methods. However, for multiagent scenarios, there is an additional challenge of the exponential growth of all the agents’ action spaces for centralized methods [169]. One way to tackle this challenge within multiagent scenarios is the use of search parallelization [35, 171]. Given more scalable planners, there is room for research in combining these techniques in MDRL settings.
蒙特卡洛树搜索(MCTS)[51]是AlphaGo[291]和AlphaGo Zero[293]将搜索和DRL结合在一起的重大突破的支柱。最近的一项研究[340]概述了如何更好地将搜索和RL结合起来,以获得潜在的新方法。然而,对于多智能体场景,还有一个额外的挑战,即集中式方法的所有智能体的操作空间呈指数增长[169]。在多智能体场景中解决这一挑战的一种方法是使用搜索并行化[35,171]。鉴于更具可扩展性的规划器,在MDRL环境中结合这些技术还有研究空间。
The learning of complex multiagent interactions often necessitates the application of certain abstraction types [84], as exemplified by the exploitation of factored value functions [8, [others]] (see QMIX and VDN in Section 3.5 for recent developments in multiagent deep reinforcement learning). Furthermore, challenges remain within this domain, including an understanding of their representational power [64] (for instance, the accuracy of learned Q-function approximations) and the development of effective methods to learn factorizations, where techniques from transfer planning [...] could prove beneficial [among others]. In transfer planning, a simpler source problem is typically defined – such as one involving fewer agents – enabling agents to plan or learn within this more straightforward framework. Since this source problem is less complex than the original multiagent challenge, it helps mitigate issues like environmental non-stationarity. Additionally, another approach involves influence abstractions, which aim to construct simplified models based on the extent to which agents can influence one another's behaviors. While this approach remains underexplored in practical multiagent settings, evidence suggests that it can lead to improved inductive biases, thereby enhancing the effectiveness of deep reinforcement learning methods when applied to local abstractions [309].
5 Conclusions
5 结论
近年来,在多个领域中(如参考文献 6)、参考文献 7 和 参考文献 8)取得了显著进展
该调查对多智能体深度强化学习(MDRL)这一新兴领域的最新研究进行了全面综述。首先,我们将最近的研究分为四个主要领域:涌现行为、学习通信、学习协作以及代理建模代理。随后,我们通过实例说明了源自强化学习(RL)与多代理学习(MAL)的关键组件(如经验重放与差异奖励)如何在MDRL中进行适应性应用。我们总结了一些适用于MDRL的经验教训,并指向当前多代理基准测试的情况以及一些尚待解决的研究问题。最后,我们还就计算需求与可重复性等实际挑战进行了反思。
Our work's conclusions highlight that while both deep reinforcement learning (DRL) and multiagent deep reinforcement learning (MDRL) have achieved notable accomplishments and hold significant importance as key milestones in AI development, it is essential to acknowledge that these fields still present open questions. Both single-agent learning frameworks face challenges such as high computational demands, complex reproducibility issues (including hyperparameter tuning), and a lack of encouragement for publishing negative results. Furthermore, we acknowledge existing challenges within MDRL that hinder its scientific progress: high computational requirements, intricate reproducibility concerns (e.g., hyperparameter tuning), and insufficient incentives for publishing negative results. However, we remain highly optimistic about the future of multiagent learning. We believe this work can contribute to addressing these challenges by proposing effective solutions while leveraging existing literature and resources to advance the field toward its rightful trajectory.
Notes 笔记
We have observed discrepancies in the use of abbreviations, including D-MARL, MADRL, deep-multiagent RL and MA-DRL.
A Partially Observable Markov Decision Process (POMDP) [14, 63] formally models environments where the agent lacks visibility into the true system state and instead acquires observations (generated from the underlying system state).
部分可观察马尔可夫决策过程(POMDP)[14, 63]正式化地模拟了代理在缺乏对系统真实状态的完全了解的情况下以及接收观察(从底层系统状态生成)。
Action-based baselines have been introduced [117, 202]. However, a recent study by Tucker et al. [33]] revealed that many works achieved superior performance due to software-related issues instead of relying on our approach.
Prior to the introduction of DQN, a variety of methods employed neural networks to represent the Q-value function [74], such as Neural Fitted Q-learning [268] and NEAT+Q [351].
Double Q-learning最初提出保留两个Q函数(估计器)以减少RL中的高估偏差的同时仍然保持收敛保证随后在Double DQN中扩展到DRL(参见第4.1节)。
In this scenario, each agent independently follows its own policy. But in some instances, this is not applicable; for instance, when agents employ a coordinated exploration strategy.
基于后悔最小化的原理,反事实后悔最小化是一种解决大型博弈的技术。它owing to the well-established link between regret and Nash equilibrium [39], serves as a cornerstone for achieving computational efficiency in extensive-form games. This approach has proven to be particularly effective in applications such as game theory and artificial intelligence.
该算法与CFR-BR [[[参考文献
TFT originated from an iterative prisoner's dilemma competition and subsequently served as a foundation for various leader-based strategies within the MAL framework [258]. Its evolution, commonly referred to as the "Godfather" strategy [201], represents a significant advancement in leading approaches.
The average strategy profile of fictitious players tends to a Nash equilibrium in specific categories of games, for instance, two-player zero-sum and potential games [[ref-CR-Link "ref-CR-Link")][ref-CR-Number]
The terminology used by agents was arbitrarily chosen and lacked inherent semantic significance. To comprehend their evolving semantics, they analyzed the connections between symbols and the image sets they represented [[Citation:
copious research on coordinating multiagent teams through the specification of communication protocols [ cited: 两个参考文献 ] : these require agents to be aware of both the collective objective and the tasks needed to achieve it.
采用正态分布模型评估每个玩家的能力等级,在每场比赛后会根据'惊喜度'对两个玩家的能力分布进行更新;例如,在较低预测能力的用户击败较高能力的用户时(即预测能力低于实际水平),该用户的排名显著提升。
Nash equilibrium [229] represents a fundamental solution concept in game theory, characterized by the principle that no intelligent entity would choose to stray from its assigned strategy, as it constitutes their optimal response to others' strategies. This foundational concept has been extensively explored in pioneering MAL (multi-agent learning) algorithms, notably including Nash-Q learning [149] and Minimax-Q learning [198,199], which have significantly advanced the field of multi-agent systems.
Johanson et al. [引用]也观察到,在处理大规模广泛的游戏(例如扑克)时会发生过拟合现象——在抽象模型中性能有所提升,在真实情况下则表现不佳。
Bayesian policy reuse relies on an agent having prior experience in the form of a collection of stored policies. Once a new task instance arises, the goal is for the system to utilize a stored policy that correlates with observed signals related to policy performance [272].
贝叶斯策略重用基于代理具有存储政策的形式先前经验。一旦一个新的任务实例出现,则旨在利用与其性能相关联的信号,在其策略库存中选择合适的策略[272]。
Not only does the centralized approach enable centralized decision-making, but also the distributed implementation provides a common framework for multiagent planning [239].
[GitHub - gjp1203/nui_in_madrl "GitHub - gjp1203/nui_in_madrl: Negative Interval Updates in Multi-Agent Deep Reinforcement Learning"].
GitHub - gjp1203/nui_in_madrl: Negative or Non-Positive Update Intervals in Multi-Agent Deep Reinforcement Learning
GitHub repository located at oxwhirl/smac, the SMAC project aims to solve the StarCraft multi-agent challenge.
GitHub - oxwhirl/pymarl: A Py Multi- agent Reinforcement Learning Suite.
GitHub - crowdAI/maro...: Version 1.0 Initial Release for the Marlo competition
GitHub - google-deepmind/hanabi-learning-environment: The hanabi_learning_environment serves as a research platform aimed at advancing Hanabi experiments.
GitHub / YuhangSong's Repository named Arena-BuildingToolkit contains a versatile platform designed to evaluate and construct systems for both single and multi-agent intelligence. This toolkit was presented at the AAAI Conference in the year 2020.
dm_control/dm_control/移动控制/足球运动 at main · google-deepmind/dm_control · GitHub.
GitHub - openai/neural-mmo: Repository accompanying 'Neural MMO: A Massively Multiagent Game Setting for Training and Evaluation of Intelligent Agents'
这个想法最初源于NeurIPS 2018年举办的"批评与纠正机器学习趋势"会议期间的一个活动,在那里允许提交负面结果论文:"展示现有算法失败模式的论文或提出可能预期表现良好但实际上不奏效的新方法的论文。其目的在于为那些虽未被发表但仍具吸引力的研究工作提供平台 https://ml-critique-correct.github.io/
There are instances where the meaning of frame remains unclear in literature due to the "frame skip" technique. Therefore, it is suggested that "game frames" and "training frames" be referred to as "game frames" [310].
One recent effort by Beeching et al. [29] aims to utilize intermediate-range hardware (eight CPUs and a single GPU) for training deep reinforcement learning agents. Beech等人[29]最近的一项研究则提倡采用中端硬件配置(八块CPU与一块GPU),以实现深度强化学习代理的训练。
NeurIPS 2019 organizes the "MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors", whose primary objective is to encourage the creation of algorithms that make effective use of human demonstrations in order to significantly decrease the number of samples needed to address complex hierarchical environments characterized by sparsity [125]. The competition is organized by NeurIPS 2019. The main goal of this competition is to promote the development of algorithms that can create effective utilization of human demonstrations in order to significantly reduce the number of samples required for solving complex hierarchical environments with sparse characteristics [ 125].
Cuccu, Togelius, and Cudré-Mauroux demonstrated leading-edge policy learning in Atari games using just 6 to 18 neural units [75]. Their approach involved separating image processing from decision-making.
Cuccu、Togelius 和 Cudré-Mauroux 在雅达利游戏中仅用最少数量(6至18个)神经元完成了最先进技术的学习 [75]. 其核心理念在于将图像处理与决策过程区分开来。
