Advertisement

An Casual Overview of Reinforcement Learning

阅读量:

[update 20200712]

OpenAI官方发布的平台提供了‘最好的参考’资源链接:[spinningup](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html)


Plan

  1. 学习完李宏毅关于强化学习的视频
  2. 从头开始逐步实现,并参考OpenAI的建议
  3. 在同时进行掌握PyTorch和TensorFlow的同时打好深度学习基础。
  4. 当有机会时持续关注前沿研究领域。

强化学习概览

This comprehensive overview is significantly derived from the article: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc.

On-Policy vs Off-Policy

学习过李宏毅关于深度强化学习的视频后明白,在基于TD的学习框架下,replay buffer与策略切换(on/off)之间的关系主要体现在经验分布上。由于replay buffer中的元组(tuple)并非轨迹(trajectory),而是经验(experience),因此其对应关系并不直接依赖于当前使用的策略。然而,在replay buffer中存储的经验分布与使用当前策略生成的数据分布是不同的;此外,在采样经验时通常采用均匀概率。因此,在这种情况下无法直接对应到当前政策上。针对MC的情况——即基于完整轨迹的方法——当使用策略π生成轨迹来训练新策略π'时,则会更加属于离线策略评估范畴而非在线更新范畴。

Why is Q-Learning considered a form of off-policy learning?

核心判断,在更新行为策略Q时(记为Q),需识别当前行为策略下所对应的政策与实际执行该策略时与环境交互的行为策略是否一致。在SARSA算法中此条件成立;而在Q学习算法中(即Q-learning),其更新机制的本质是对当前状态-动作关系进行建模以优化长期奖励预期的过程

i^*

而非当前

i

.**涉及到的Q(s',a')中的a'是否由当前actor根据s'得出,抑或是一种approximation like the max function in q-learning. 在使用replay-buffer的情况下,或者a'由target actor生成的情况下,称为off-policy。否则是on-policy。这是我个人浏览了很多信息后,目前的理解[update 0413]见下图,目前有了新的理解:remember Qfunction是基于TD的,前后action是有顺序关系的。换句话说,train Q的时候,需要知道,是Q(?)跟当前的Q差了一个r。这时,如果此处的?与当前policy应当输出的action相符,说明我们想要把Q 按照当前policy去train,所以是on policy。否则的话,如Sarsa,当前policy给出的是epsilon-greedy choice但是train Q的时候假定下一步是totally greedy的,所以Q与当前policy不符合,所以是off。

在涉及replay buffer的情况下, relay buffer提供的a'源自过去某个策略执行的动作,与当前actor能够产生的动作存在差异,因此Q并未寻求将其训练为对应于当前策略的评估函数,因而导致其无法实现这一目标

总结:在训练Q时的情况决定了我们是开启还是关闭;简而言之就是,在判断Q(s', a')中的a'与当前actor所建议的动作是否一致的基础上进行操作。换句话说就是:我们将Q网络训练成一个能够评估动作价值(value function)的功能。

那么接下来的问题是什么呢?如何选择两者之间的关系?对此类问题的解答我们可以参考以下内容尤其是对于'take action'这一概念的理解。综合而言,Q学习在离线策略下可以直接学习最优策略(policy)但可能会遇到不稳定性和收敛性等问题。与Q学习相比,SARSA更为保守因此在训练成本较高的情况下建议采用此方法。

https://stats.stackexchange.com/questions/326788/when-to-choose-sarsa-vs-q-learning

最后一个问题促使我认识到动态规划(TD)方法实际上具有内在的时间序列特性:具体而言,在状态-动作三元组中从a到a'再到a''的过程与从a直接到a''再到a'的过程所对应的Q(s, a)值原本就不应该一致。针对这一思想的本质以及其背后的理论依据和应用限制问题,则需要进一步深入探讨

On-policy v.s. Off-policy

An onpolicy learning agent is trained to derive the value function based on its current action a, which is determined by the underlying policy. In contrast, an offpolicy learning agent is trained to derive the value function based on another set of actions a*, which are determined by different policies.

The reason why Q-learning is model-free is because it updates its Q-values based on the Q-value of the subsequent state s′ and the optimal action a′. In other words, it calculates the return (total discounted future reward) for state-action pairs under the assumption of optimal play despite not following an optimal policy.

Why is SARSA considered on-policy? Because it updates its Q-values by utilizing the Q-value of the next state s′ and adheres to actions determined by the current policy. The method assumes that following this policy will yield accurate return estimations for each state-action pair.

The distinction vanishes when the current policy is a greedy one. However, such an agent would not be effective because they never explore.

Policy Optimization (Policy Iteration)

f(state) -> action.

What action to take now?

  1. Policy Gradient. Think of play a game 10000 times and check the expected reward of f(s), then you can do gradient asent on it.
  2. Trust Region Policy Optimization (TRPO). A on-policy method which do PG with a big step but not too big: it is within the trust region. This is ensured by a constraint on the difference between the current behavior (not parameters) and the behavior of the target network. Both continuous and discrete space are supported.
  3. Proximal Policy Optimization (PPO). Like TRPO is an on-policy method, but traite that contraint as penalty (regularization). Popularized by OpenAI. Do PG by improving gradients smartly to avoid performance issue and instability. Can be implemented and solved more easily than TRPO but with similar performance so it is preferred than TRPO.

Remarks

  1. Why we invented TRPO/PPO: Regenerating all samples for each policy update is prohibitively expensive. TRPO/PPO enables the reuse of old experiences, which facilitates transitioning from on-policy training to off-policy learning.
  2. Rewards should be centered at zero. Since PG relies on sampling, when all rewards are positive, the likelihood of selecting certain actions would progressively decrease.
  3. When encountering a state-action pair (s,a), only the discounted reward received after this state-action pair needs to be accounted for.

[Update 20200719]

Policy gradient is merely a subset of Policy Search. While PG employs gradients to optimize each action, other methods can directly search and optimize within the policy space. A notable resource: ICML2015 Tutorial. For further exploration of PS, please refer to additional materials.

Q-Learning (Value Iteration)

f(state, action)->expected action value

Action-value functoin: How good is it if an particular action is taken?

DQN存在有高估Q值的倾向,其原因在于贪心策略采用了max_a Q(s,a)这一设定。解决这一问题的若干变体及建议:采用分段学习方法;通过调整目标函数中的折扣因子γ;引入 Experience replay 技术。

  1. Double-DQN是基于目标网络的一种改进型DQN算法。
  2. Dueling-DQN通过将Q函数分解为状态价值V(s)和动作优势A(s,a)两部分来提升性能。
  3. 劣势体现在更新后的V值会对每个动作的优势度A(s,a)产生影响,即使某些动作未被采样到。在实践中,
    为了确保总优势度为零并避免数值爆炸,
    需对A进行归一化处理:
    使sum(A)=0(通过减去平均值实现);
    同时施加约束条件以防止神经网络简单地将V设为零。
  4. Prioritized Replay: 优先 replay机制会选择具有较大TD误差的样本进行训练。

Through multi-step integration, this approach integrates Markov chain (MC) with transfer distribution (TD), employing a series of successive transitions to model the system dynamics effectively.

Noisy Exploration : 噪声作用于动作(如epsilon贪心策略)或在每个回合开始前的参数(如theta噪声)上施加噪声的状态依赖性探索。

Distributional Q : Q(S,a) has stopped being merely a scalar expected value and now represents a distribution comprising several bins. Distributional Reinforcement Learning with Quantile Regression (QR-DQN) models this by representing the possible outcomes directly through distributions rather than relying on expected values. By using quantiles, the optimal actions can be determined. C51: The algorithm employs the distributional Bellman equation for its computations, focusing solely on future rewards distributions rather than their expected values.

Rainbow算法定义为基于DQN的方法结合双端深度学习网络中的噪声优先级分配多步策略

Hindsight Experience Replay. DQN using goals incorporated into the input. Moreover, it is particularly useful in scenarios characterized by scarcity of reward situations, such as zero-sum games. Additionally, it can be effectively integrated with Deep Determinative Policy Gradient (DDPG).

DQN for Continuous actions

sampling actions

gradient ascent to solve the argmax (ddpg)

Adjust the network architecture to enhance the optimization process. (As per the referenced paper, are we employing deep learning techniques to address this optimization challenge?)

Hybrid

Examine both the Actor-Critic architecture and the DDPG algorithm, as well as their integration with GANs.

1.DDPG
2.A3C采用异步训练机制:多个智能体并行学习。actor-critic架构结合了policy gradient和q-learning算法。此外还实现了soft actor-critic方法。
3.TD3

Model-based vs Model free

  • Model: 世界模型。关于环境的结构化信息被利用来进行规划。在一定程度上了解状态转移概率。
  • Model-free方法将环境视为黑箱,并仅提供状态和奖励作为数字。无法获取额外的信息。

Detailed information on model-based approaches, visit Model-based Reinforcement Learning for further insights.

Other topics

Some other research directions:

Sparse Reward

A significant number of practical scenarios involve rewards that are often extremely sparse. For instance, consider a robotic task focused on inserting a cylindrical object into an opening. Most attempts fail, resulting in 0 rewards. To address this challenge,**

  • 奖励调节:通过人工增加更多奖励来引导agent完成任务(利用领域知识进行指导)。
  • 递进式学习:从简单任务循序渐进地学习并掌握技能。首先使用容易学习的数据集和带奖励的元组示例进行训练,在逐步增加复杂数据集的同时引入稀疏奖励体验。
  • 逆向 curriculum生成:先采样接近目标的状态,在逐步远离目标区域的过程中探索。

Hierarchical RL

将目标划分为子目标,并且这些子目标可能与最终目标没有直接关联。接着这些子目标可以进一步划分为更小的层级目标。

Immitation Learning

  • Behavior cloning:same as supervised learning. Problems:
    1. Experience is limited. Try data aggreation: expert-> pi_1->trajectories->expert label->pi_2... Not a good solution
    2. Some parts of the demostration is to be cloned, others are not, but the learning does not know
    3. Data mismatch. TO BE CLARIFIED The distribution of the training data and testing data is not the same. Even if the learner has learnt 99% from the expert, the resulting reward could be very different due to the nature of RL/MDP.
  • Inverse RL: more interesting than behavior cloning: instead of learning by cloning expert actions, first deduct the reward function based on expert's action, then optimize over that function.
    1. How to learn the reward: again GAN. Update the reward function so that the teacher's actions are always better. Update the actor to get better reward.
    2. Different ways to demostrate: first-person or third-person. In the third-person case, add feature extraction layers to the network in order to make the third-person experiences like a first-person one.
    3. Advantage: usually not many demostrations are needed.
    4. CHECK THE LINK WITH STRUCTURED LEARNING

Meta-learning/Transfer-learning

I would say it's the same as in the context of DL

Multi-agent

等学到这里,顺便学习一下博弈论吧!

全部评论 (0)

还没有任何评论哟~