About communication in Multi-Agent Reinforcement Learning
Communication plays a key role in Multi-Agent Reinforcement Learning (MARL) as well as being an active field of research. It could significantly impact the final performance of agents since it influences their ability to coordinate or negotiate directly. Effective communication is crucial for achieving successful interaction by addressing challenges such as cooperation, coordination, and negotiation among multiple agents.
A substantial portion of research within the domain of multiagent systems has focused on addressing an agent's communication requirements: specifically, determining what information to transmit, when it should be sent, and whom it should be directed toward. This endeavor has led to the development of strategies tailored for particular applications. Notable communication protocols such as "cheap talk" can be characterized as "talking before acting," where speech precedes any subsequent action. In contrast, other approaches adopt a "talking after acting" framework. For instance, one protocol involves an agent with incomplete information that conveys its lack thereof through actions alone—asserting that silence or omission speaks louder than words.
Investigate the ways various approaches for learning communication protocols using Deep Neural Networks can contribute to advancing our understanding, alongside presenting novel insights from analyzing three distinct research papers—one serving as a foundational benchmark while the other two actively participated in the ICML 2019 conference.
Mastering communication through deep multi-agent reinforcement learning, as detailed in the paper](https://arxiv.org/pdf/1605.06676.pdf)
The characterization of messages as communication protocols has been integrated into a Q-Learning framework, where they undergo training to influence action selection. Derived from a unit DRU, this system enables communication to be shared and trained effectively across agents, augmenting the training signals.
Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning
Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning
Introduces innovation in decentralized learning through the introduction of new reward functions, metrics, and topologies, which was used to address the communication challenge within a decentralized system.
3 ) TARMAC: Targeted communication for MARL paper,
The research investigates both targeted and multi-stage communication advantages. The aim is to propose a potential solution for establishing effective collaboration in intricate frameworks using a customized attention mechanism.
One central concept here pertains to the balance between centralized and decentralized approaches in learning and execution, as highlighted across various papers. The primary approach involves finding a combination of centralized learning, characterized by parameter sharing, alongside decentralized execution, where each agent operates independently on its outputs, though this method varies according to the specific paper and experimental setup.
Exploring the 2016 paper titled "Learning to Communicate with Deep Multi-Agent RL" provides foundational insights into the application of neural networks within multi-agent reinforcement learning systems.
1 ) Learning to Communicate with Deep Multi-Agent Reinforcement Learning
Deep Multi-agent Reinforcement Learning represents a significant step toward enabling agents to utilize machine learning techniques for automatic discovery of communication protocols within cooperative setups. This study explores what deep learning offers for this purpose, focusing on a comprehensive examination of deep neural network applications in multiagent systems characterized by partial observability. The research investigates two distinct approaches: Reinforced Inter-Agent Learning (RIAL) and Differentiable Inter-Agent Learning (DIAL), which distinguish themselves primarily by their differing mechanisms for gradient flow during the learning process, which contributes to advancing differentiable communication. While RIAL maintains end-to-end trainability within individual agents, DIAL achieves it across agent boundaries.

RIAL vs DIAL : A CAPABLE OF BEING TRAINED APPROACH PER AGENT COMPARED TO ACROSS ALL AGENTS WITHIN A MULTIAGENT FRAMEWORK THAT INCORPORATES THE DRU DISCRETIZE/REGULARIZED UNIT. RIAL: Within a specific time step t, an agent receives an observation ot from its environment alongside another input mt-1—this being the communication message received in the preceding time step. These inputs are then processed by a Q-NET structure before being integrated into an action selection mechanism that determines both environmental interactions and interagent messaging. The selected actions result in sending messages between agents (specifically between Agent 2 at time step t−1 and their respective environments) as part of this coordination process. In time step t+1, Agent 2 receives this message along with new observations from its environment. This sequence of events is repeated iteratively for each subsequent time step within this framework.
- RIAL . 该协议包含多个智能体在连续的t-1和t时间步中进行通信与行动选择过程。 RIAL是一种基于深度递归强化学习(Reinforcement Learning, RL)的通信方案,并整合了独立于环境的动作信息传递机制。 其核心设计在于将状态价值函数进行分解:环境相关的状态价值函数Qau以及动作信息的价值函数Qam。
All learning loops consist of two consecutive time steps, with gradients being transmitted exclusively within a single agent's Q network. RIAL is designed to optimize parameter serving, minimizing the number of training parameters. Nevertheless, in RIAL there is no exchange of information regarding their communication actions.
DIAL originates from RIAL but differs in how gradient flow is computed and executed, by propagating gradients via the communication channel.

At time step t, C_net Agent 1's outputs consist of both Q actions for environmental interaction and message m. Here lies the distinction: instead of inputting to an action selection mechanism, it is inputted into a discretization/regularization module DRU(ma), which integrates centralized learning with regularization techniques and facilitates decentralized execution. A scenario where multiple agents can learn collaboratively while independently executing their own actions.
2 ) Social Influence as Intrinsic Motivation for Multi-Agent Reinforcement Learning
阐述了分布式学习的创新性,并将其与之前的分布式执行进行了对比。这种方法表现为赋予智能体因影响其他智能体动作而获得内在奖励的能力,并提供了在事件已经发生的情况下可能的替代方案。这些原本可能被采取的动作,在表现出更优结果时会受到奖励。因此,在这种情况下通信对马尔可夫多agent体系(MA-MDPs)的影响是直接而显著的。最终,在一个更高的抽象层次上,本文探讨了影响如何在协调与通信中发挥作用。
The approach extends beyond the conventional "doing by talking" and "talking by doing" frameworks. It aims to pursue the guessing based on observation of others' current actions or potential actions and, in its ultimate form, focuses on inferring outcomes through analysis of observed behaviors.
In this scenario, each agent is provided with a pre-trained neural network that functions as a model representing other agents - MOA. This model is utilized within either competitive or collaborative contexts. The collective actions of all agent instances are aggregated, and each individual agent computes its individual reward based on the collective actions.
The study is structured into three distinct categories of experiments: basic influence, influential communication, and modeling the behavior of other agents. This division gives rise to various experimental setups across two distinct environments: cleanup and harvesting operations.
2A ) Basic Social Influence
In an initial baseline influence experiment conducted as part of our research framework, the A3C agent was benchmarked against a simplified version of the same influencer setup. The experimental results in the long-tail category demonstrated significant success rates, wherein an enhanced reward function incorporating both influencer engagement and environmental impact was implemented. Specifically, under these conditions, we derived a novel probability distribution incorporating diverse action options through the exploration of counterfactual actions. This approach utilized centralized training methodologies while making the assumption that influencer impact operates in a singular directional manner.
purple influencer/speaker versus yellow influenced/listener agent: the behavior is conditioned on social influence rewards only when apples are present. Yellow agents exhibit strong behavioral tendencies.
KEY : Basic Influence combines both extrinsic/environmental rewards with causally mediated rewards.
2B ) Influential Communication
After conducting experiments based on baseline social influence findings, the message -discrete communication symbol- is designed to acquire relevant policies. This influential communication protocol operates across multiple hierarchical levels.
Within one hand, two distinct heads are trained using distinct policy-value function pairs; one serves as the primary operational head for interaction within the environment, while the secondary head acts as a speculation policy to generate communication symbols.

其拓扑结构在其影响性通信机制中被设计为环境与所有代理体的离散消息向量按顺序训练而成。
在Influential通信中,状态被输入到卷积层以及两个全连接层。
最后的LSTM单元通过整合通信信息并结合前一时间步的信息来完成任务。
The variables **Vm** and πm were constructed by incorporating a sum of environmental rewards and causal influence rewards into the agent’s immediate reward system.
KEY : During influential communication, two distinct policy variants are optimized—one dedicated to environmental aspects and another tailored to refine communication protocols.
A potential method for assessing effective communication might lead one to consider measuring better performance through task reward metrics, which holds true at a high level. However, the paper presents new cognitive metrics within the field of influential communication with the aim of analyzing communication behavior and evaluating its quality:
- 说话一致性 [0,1]:当说话者执行特定动作时发出特定符号的概率或信任度指标。该指标旨在衡量说话者的行为动作与相应消息之间的1:1对应关系的程度。具体而言,它评估了在给定消息和消息给定动作时的动作熵。
- 即时协调 (IC) 是衡量代理之间通过沟通协调程度的指标。该指标从两个层面进行:
- 符号/行动IC 衡量了信息者消息与影响下一行动之间的互信息量,在信息者根据收信者消息调整自身行为时达到峰值。
- 行动/行动IC 则衡量了信息者自身行为与影响下一行动之间的互信息量。
Here you have some bullet points and lessons learn from here.
The impact of influence is limited to specific temporal dimensions.
Listeners tune their attention to speakers only when it proves advantageous.
Most responsive agent agents achieve enhanced individual environmental rewards.
Through measurements in several experiments, the experimental findings reveal how listeners tune their listening focus to optimize their own benefit whenever advantageous. When it comes to other agents who are most susceptible to influence, they tend to attain an enhanced personal environmental benefit. Additionally, any communication mechanism must incorporate information aimed at assisting listeners in maximizing their individual environmental rewards.
2C ) Influential Communication vs Model of Other Agents
MOA presents a novel topology structure designed to enhance multi-agent interaction efficiency. Here emerges the innovation as achieved through equipping each agent with an individual internal model representing others' behaviors, while extracting centralized learning patterns. Introduces a series of layers positioned after convolutional operations, which predict an agent's next actions based on its previous state.
Once trained, it can be used to compute the social influence reward.

The MOA method acquires knowledge of two components: a policy and a supervised model, which forecasts the upcoming action step of other entities.
As complexity increases in the environment and communication messages vary, neural network topologies might shift, yet in this instance remain unchanged.
KEY : Two neural networks are used to calculate environmental policy and probabilistic action models.
This paper is exceptional, offering not only innovative ideas but also significant contributions through its communication framework and experimental evolution. The conducted experiments provide deeper insights into its research methodology. The experimental results demonstrate enhanced performance in terms of long-tail efficiency for this communication protocol.
3 ) TARMAC : Targeted Multi-agent Communication
In this paper, a cooperative Multi-agent environment is established wherein a highly efficient communication protocol is essential. By focusing on targeted communication through deep reinforcement learning, the agents develop strategies for precise interactions — determining the appropriate messages and recipients — thereby enhancing their ability to collaborate flexibly in complex environments.
Using aimed-at communications, the paper transmits specific messages directly to particular recipient groups, where agents are trained to both transmit specific messages and address them appropriately. This method of communication is learned implicitly through end-to-end training utilizing task-specific rewards. The key distinction from the previous study lies in the fact that agents communicate using continuous vectors instead of discrete symbols.

During each timestep, each agent acquires an input comprising an observation variable wt and an aggregated continuous message ct, from which it estimates both an environmental action and a targeted communication message mt. The diverse messages from various agents are subsequently combined into a single integrated message.
During each timestep, each agent acquires an input comprising an observation variable wt and an aggregated continuous message ct, from which it estimates both an environmental action and a targeted communication message mt. The diverse messages from various agents are subsequently combined into a single integrated message.
Targeted, Multi-Stage Communication

该多阶段通信协议提出了一种注意力机制。每个智能体拥有一条消息,由两部分组成:用于编码智能体特定信息的签名k和包含实际消息的价值v。此外,向量预测q来源于隐藏状态。
To generate an attention weight for each value vector, the signature and query value are processed. The resulting aggregated message is then processed by the receiver.
Thanks for reaching this point!
This medium article was created as part of my ICML 2019 papers' rejected list. If you could consider adding more content or if you plan to attend the conference and wish to discuss MARL related topics, DM me on Twitter: @SoyGema.
