论文总结(Reinforcement Learning; Merging; Multi-agent)
论文总结
1. Cooperation-Aware Reinforcement Learning for Merging in Dense Traffic
_原文链接:https://arxiv.org/pdf/1906.11021.pdf_
场景:密集道路中的并道,协作强化学习
ⅠBackground
方法 :POMDP,Deep Q learning;
贝尔曼方程 :
损失函数设计 :


Ⅱ PROPOSED APPROACH
A. Merging Scenario POMDP
State:
行为特点(级别):c
车辆状态:
s^i_t = (x^i_t,v^i_t,a^i_t,c^i_t)
Observations:
The ego vehicle has limited sensing capabilities. It can only sense the vehicles within a certain range, and cannot measure internal states of other vehicles.
观测量:四个邻居车辆(本车的前邻车、汇合点正后方的车辆、本车在主车道上投影的后邻车和前邻车)的纵向位置与速度。

The total dimension of the observation space is 15 when the internal states are observed and 12 otherwise.
Actions:
The ego vehicle controls its motion by applying a change in acceleration. 改变加速度
where ∆a is the action chosen by the agent in the set {−1 m/s2, −0.5 m/s2, 0 m/s2, 0.5 m/s2, 1 m/s2}
同时还有硬制动(-4)和无加速度(0);
所以动作空间有七种选择;
Reward:
-1:发生碰撞
1 :到达目标点(merging point 后50m)
The time minimization is incentivized by the discount factor, the sooner the goal is reached, the less the bonus is discounted.
Transition:
遵循运动方程

B. Driver Model
a cooperation level c ∈ [0, 1]:
- c = 1 represents a driver who slows down to yield to the merging vehicle if she predicts that the merging vehicle will arrive ahead of time.
- c = 0 represents a driver who completely ignores the merging vehicle until it traverses the merge point and follows standard IDM.
C-IDM :依靠估计主车道上的汽车 (TTMa) 和合并车道上的汽车 (TTMb) 到达合并点 (TTM) 的时间来决定是否应考虑合并车辆。 一旦估计了两辆车的合并时间,就会考虑三种情况:
TTMb < c × TTMa:合并车道车辆先到,主车道上的车辆通过考虑合并车辆在主车道上的投影作为其前车来遵循IDM。
TTMb >= c × TTMa:主车道上的车辆遵循IDM。
在没有合并车辆或远离合并车辆的情况下,C-IDM驱动程序遵循标准IDM

C. Inferring the Cooperation Level
Since the cooperation level cannot be directly measured, the ego car maintains a belief on the cooperation level of the observed drivers.
At each time step, the belief is updated using a recursive Bayesian filter given the current observation of the environment.
状态可观测而优先级不可观测,这种混合可观察性的假设可以帮助我们降低belief更新的计算成本。
The agent only maintains a distribution over the cooperation level of observed drivers instead of estimating the full state of the environment.
θit represents theprobability that vehicle i has a cooperation level of 1.

在给定合作级别的情况下,从 ot 到 ot+1 的转换概率是通过使用所提出的转换模型向前传播状态 并假设以预测值为中心的高斯分布 来计算的,位置的标准偏差为 1 m,1 m /s 表示速度。不考虑噪声,则在一次观察后,belief会收敛到 θi = 0 或 θi = 1,因为这两个模型将是完全可区分的。
这项工作的重点不是开发有效的驾驶员状态预测器,而是讨论 RL 代理如何使用这些信息。
D. Belief State Reinforcement Learning
Although memoryless policies can be efficient, reasoning about latent states can often lead to a significant improvement.
It is expected that the ego vehicle will only try to merge in front of cooperative drivers.
In order to learn such a behavior through reinforcement learning, we propose to use the belief state as input to the reinforcement learning policy.
A transition in the belief state MDP can be described as follows:
At time step t, the agent has a belief bt−1 and receives an observation ot
The new belief bt is computed using eq. (7)
The agent takes action at = arg maxa Q(bt, a) The rest of the algorithm is identical to standard DQN.
The input to the network is a vector of dimension 15: bt = [ot, θ1t , . . . , θnt ]
