An Casual Overview of Reinforcement Learning

阅读量：

[update 20200712]

OpenAI官方发布的平台提供了‘最好的参考’资源链接： $[spinningup](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html)$

Plan

学习完李宏毅关于强化学习的视频
从头开始逐步实现，并参考OpenAI的建议
在同时进行掌握PyTorch和TensorFlow的同时打好深度学习基础。
当有机会时持续关注前沿研究领域。

强化学习概览

This comprehensive overview is significantly derived from the article: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc.

On-Policy vs Off-Policy

学习过李宏毅关于深度强化学习的视频后明白，在基于TD的学习框架下，replay buffer与策略切换（on/off）之间的关系主要体现在经验分布上。由于replay buffer中的元组（tuple）并非轨迹（trajectory），而是经验（experience），因此其对应关系并不直接依赖于当前使用的策略。然而，在replay buffer中存储的经验分布与使用当前策略生成的数据分布是不同的；此外，在采样经验时通常采用均匀概率。因此，在这种情况下无法直接对应到当前政策上。针对MC的情况——即基于完整轨迹的方法——当使用策略π生成轨迹来训练新策略π'时，则会更加属于离线策略评估范畴而非在线更新范畴。

Why is Q-Learning considered a form of off-policy learning?

核心判断，在更新行为策略Q时（记为 $Q$ ），需识别当前行为策略下所对应的政策与实际执行该策略时与环境交互的行为策略是否一致。在SARSA算法中此条件成立；而在Q学习算法中（即Q-learning），其更新机制的本质是对当前状态-动作关系进行建模以优化长期奖励预期的过程

而非当前

.**涉及到的Q(s',a')中的a'是否由当前actor根据s'得出,抑或是一种approximation like the max function in q-learning. 在使用replay-buffer的情况下，或者a'由target actor生成的情况下，称为off-policy。否则是on-policy。这是我个人浏览了很多信息后，目前的理解。[update 0413]见下图，目前有了新的理解：remember Qfunction是基于TD的，前后action是有顺序关系的。换句话说，train Q的时候，需要知道，是Q(?)跟当前的Q差了一个r。这时，如果此处的？与当前policy应当输出的action相符，说明我们想要把Q 按照当前policy去train，所以是on policy。否则的话，如Sarsa，当前policy给出的是epsilon-greedy choice但是train Q的时候假定下一步是totally greedy的，所以Q与当前policy不符合，所以是off。

在涉及replay buffer的情况下, relay buffer提供的a'源自过去某个策略执行的动作,与当前actor能够产生的动作存在差异,因此Q并未寻求将其训练为对应于当前策略的评估函数,因而导致其无法实现这一目标

总结：在训练Q时的情况决定了我们是开启还是关闭；简而言之就是，在判断Q(s', a')中的a'与当前actor所建议的动作是否一致的基础上进行操作。换句话说就是：我们将Q网络训练成一个能够评估动作价值（value function）的功能。

那么接下来的问题是什么呢？如何选择两者之间的关系？对此类问题的解答我们可以参考以下内容尤其是对于'take action'这一概念的理解。综合而言,Q学习在离线策略下可以直接学习最优策略（policy）但可能会遇到不稳定性和收敛性等问题。与Q学习相比,SARSA更为保守因此在训练成本较高的情况下建议采用此方法。

https://stats.stackexchange.com/questions/326788/when-to-choose-sarsa-vs-q-learning

最后一个问题促使我认识到动态规划（TD）方法实际上具有内在的时间序列特性：具体而言，在状态-动作三元组中从a到a'再到a''的过程与从a直接到a''再到a'的过程所对应的Q(s, a)值原本就不应该一致。针对这一思想的本质以及其背后的理论依据和应用限制问题，则需要进一步深入探讨

On-policy v.s. Off-policy

An onpolicy learning agent is trained to derive the value function based on its current action a, which is determined by the underlying policy. In contrast, an offpolicy learning agent is trained to derive the value function based on another set of actions a*, which are determined by different policies.

The reason why Q-learning is model-free is because it updates its Q-values based on the Q-value of the subsequent state s′ and the optimal action a′. In other words, it calculates the return (total discounted future reward) for state-action pairs under the assumption of optimal play despite not following an optimal policy.

Why is SARSA considered on-policy? Because it updates its Q-values by utilizing the Q-value of the next state s′ and adheres to actions determined by the current policy. The method assumes that following this policy will yield accurate return estimations for each state-action pair.

The distinction vanishes when the current policy is a greedy one. However, such an agent would not be effective because they never explore.

Policy Optimization (Policy Iteration)

f(state) -> action.

What action to take now?

Policy Gradient. Think of play a game 10000 times and check the expected reward of f(s), then you can do gradient asent on it.
Trust Region Policy Optimization (TRPO). A on-policy method which do PG with a big step but not too big: it is within the trust region. This is ensured by a constraint on the difference between the current behavior (not parameters) and the behavior of the target network. Both continuous and discrete space are supported.
Proximal Policy Optimization (PPO). Like TRPO is an on-policy method, but traite that contraint as penalty (regularization). Popularized by OpenAI. Do PG by improving gradients smartly to avoid performance issue and instability. Can be implemented and solved more easily than TRPO but with similar performance so it is preferred than TRPO.

Remarks

Why we invented TRPO/PPO: Regenerating all samples for each policy update is prohibitively expensive. TRPO/PPO enables the reuse of old experiences, which facilitates transitioning from on-policy training to off-policy learning.
Rewards should be centered at zero. Since PG relies on sampling, when all rewards are positive, the likelihood of selecting certain actions would progressively decrease.
When encountering a state-action pair $(s,a)$ , only the discounted reward received after this state-action pair needs to be accounted for.

[Update 20200719]

Policy gradient is merely a subset of Policy Search. While PG employs gradients to optimize each action, other methods can directly search and optimize within the policy space. A notable resource: ICML2015 Tutorial. For further exploration of PS, please refer to additional materials.

Q-Learning (Value Iteration)

f(state, action)->expected action value

Action-value functoin: How good is it if an particular action is taken?

DQN存在有高估Q值的倾向，其原因在于贪心策略采用了max_a Q(s,a)这一设定。解决这一问题的若干变体及建议：采用分段学习方法；通过调整目标函数中的折扣因子γ；引入 Experience replay 技术。

Double-DQN是基于目标网络的一种改进型DQN算法。
Dueling-DQN通过将Q函数分解为状态价值V(s)和动作优势A(s,a)两部分来提升性能。
劣势体现在更新后的V值会对每个动作的优势度A(s,a)产生影响，即使某些动作未被采样到。在实践中，
为了确保总优势度为零并避免数值爆炸，
需对A进行归一化处理：
使sum(A)=0（通过减去平均值实现）；
同时施加约束条件以防止神经网络简单地将V设为零。
Prioritized Replay: 优先 replay机制会选择具有较大TD误差的样本进行训练。

Through multi-step integration, this approach integrates Markov chain (MC) with transfer distribution (TD), employing a series of successive transitions to model the system dynamics effectively.

Noisy Exploration : 噪声作用于动作（如epsilon贪心策略）或在每个回合开始前的参数（如theta噪声）上施加噪声的状态依赖性探索。

Distributional Q : Q(S,a) has stopped being merely a scalar expected value and now represents a distribution comprising several bins. Distributional Reinforcement Learning with Quantile Regression (QR-DQN) models this by representing the possible outcomes directly through distributions rather than relying on expected values. By using quantiles, the optimal actions can be determined. C51: The algorithm employs the distributional Bellman equation for its computations, focusing solely on future rewards distributions rather than their expected values.

Rainbow算法定义为基于DQN的方法结合双端深度学习网络中的噪声优先级分配多步策略

Hindsight Experience Replay. DQN using goals incorporated into the input. Moreover, it is particularly useful in scenarios characterized by scarcity of reward situations, such as zero-sum games. Additionally, it can be effectively integrated with Deep Determinative Policy Gradient (DDPG).

DQN for Continuous actions

sampling actions

gradient ascent to solve the argmax (ddpg)

Adjust the network architecture to enhance the optimization process. (As per the referenced paper, are we employing deep learning techniques to address this optimization challenge?)

Hybrid

Examine both the Actor-Critic architecture and the DDPG algorithm, as well as their integration with GANs.

1. $DDPG$
2.A3C采用异步训练机制：多个智能体并行学习。actor-critic架构结合了policy gradient和q-learning算法。此外还实现了soft actor-critic方法。
3. $TD3$

Model-based vs Model free

Model: 世界模型。关于环境的结构化信息被利用来进行规划。在一定程度上了解状态转移概率。
Model-free方法将环境视为黑箱，并仅提供状态和奖励作为数字。无法获取额外的信息。

Detailed information on model-based approaches, visit Model-based Reinforcement Learning for further insights.

全部评论 (0)

还没有任何评论哟~

An Casual Overview of Reinforcement Learning

[update20200712] OpenAI的网站是很好的reference：spinningup Plan 1.看完李宏毅RL视频 2.开始onebyoneimplementation，based...

Multi-agent Reinforcement Learning: An Overview 读书笔记

MultiagentReinforcementLearning:AnOverview读书笔记目录 abstract introduction background benefitandchallen...

An overview of Supervised Learning Algorithms: A Survey

作者：禅与计算机程序设计艺术 1.简介 Supervisedlearningisatypeofmachinelearningthatinvolvestrainingthealgorithmonlabe...

Deep Reinforcement Learning : An Overview(Yuxi Li) 学习笔记

翻译的比较粗糙，仅供参考。 2.2DeepLearning 深度学习与浅层学习形成对比。对于许多机器学习算法，如线性回归、逻辑回归、支持向量机SVMs、决策树和增强等，我们有输入层和输出层，在训练前可...

An Overview of Multi-Task Learning in Deep Neural Networks

目录摘要 1\.引言 2\.动机 3\.两种深度学习多任务学习方法 3.1.硬参数共享 3.2.软参数共享 4\.为什么多任务学习有效？ 4.1.隐含的数据增强 4.2.注意力聚焦 4.3.窃听 4...

Reinforcement Learning in the Age of AlphaGo Zero: An I

作者：禅与计算机程序设计艺术 1.简介在过去的一年里，深度强化学习领域的一系列模型和方法纷纷涌现出来，如强化学习、变分自编码器、自动编码器等等。但这些模型和方法并不是都能被称之为AI，也没有像Alp...

【Xenomai 3 Learning】 – An Overview of the Real-Time Framework for Linux

WewouldliketolearnXenomaiusingthiswebsitehttp://www.cs.ru.nl/lab/xenomai/atbeginning. Theaimofthispr...

• Machine Learning Algorithmic Techniques: An Overview

作者：禅与计算机程序设计艺术 1.简介概括性地介绍机器学习中涉及到的主要算法、方法、模型、技术和工具。阐明这些概念的定义、作用、特点、优缺点等。并且为读者提供一种更高效的学习方法。本文将围绕着机器...

An Introduction to Deep Reinforcement Learning

作者：禅与计算机程序设计艺术 1.简介深度强化学习Deepreinforcementlearning,DRL是机器学习领域中一个新兴的研究方向。它将强化学习与深度神经网络结合起来，使用神经网络作为函...

Reinforcement Learning: An Introduction (David Silver)

作者：禅与计算机程序设计艺术 1.简介强化学习（ReinforcementLearning，RL）是机器学习领域的一个重要分支。它是通过试错的方式来学习控制问题，最典型的应用就是游戏编程、智能驾驶等...

An Casual Overview of Reinforcement Learning

On-Policy vs Off-Policy

Policy Optimization (Policy Iteration)

Q-Learning (Value Iteration)

Hybrid

Model-based vs Model free

Other topics

Sparse Reward

Hierarchical RL

Immitation Learning

Meta-learning/Transfer-learning

Multi-agent

全部评论 (0)

是否确定退出登录?

An Casual Overview of Reinforcement Learning

On-Policy vs Off-Policy

Policy Optimization (Policy Iteration)

Q-Learning (Value Iteration)

Hybrid

Model-based vs Model free

Other topics

Sparse Reward

Hierarchical RL

Immitation Learning

Meta-learning/Transfer-learning

Multi-agent

全部评论 (0)

相关文章推荐

An Casual Overview of Reinforcement Learning

Multi-agent Reinforcement Learning: An Overview 读书笔记

An overview of Supervised Learning Algorithms: A Survey

Deep Reinforcement Learning : An Overview(Yuxi Li) 学习笔记

An Overview of Multi-Task Learning in Deep Neural Networks

Reinforcement Learning in the Age of AlphaGo Zero: An I

【Xenomai 3 Learning】 – An Overview of the Real-Time Framework for Linux

• Machine Learning Algorithmic Techniques: An Overview

An Introduction to Deep Reinforcement Learning

Reinforcement Learning: An Introduction (David Silver)