Model-Based Reinforcement Learning

阅读量：

Model-Based Reinforcement Learning

We investigate the optimal control problem for an MDP characterized by a determined reward function R. This problem is subject to unknown deterministic transition dynamics, specifically defined by s_{t+1} = f(s_t, a_t).

In model-based reinforcement learning, this issue is tackled in two stages:

Model construction:
Through regression analysis on interaction data, we establish a dynamic model $f_\theta \approx f$ .
Path computation:
Utilizing the established dynamics model $f_\theta$ , we calculate the optimal path based on $\hat{s}_{t+1} = f_\theta(\hat{s}_t, a_t)$ .

(We can easily extend to unknown reward settings and stochastic dynamic systems, but we opt for a more straightforward scenario in this notebook to facilitate illustration)

Motivation

稀疏奖励

在无模型的强化学习中，在获得奖励时才会收到奖励信号。在存在稀疏奖励的环境中中，随机获得奖励的概率非常低（low probability），从而阻碍了有效的学习（effective learning）。
不管有没有奖励，在这种情况下我们仍然会接收到一系列的状态转移数据（state transition sequences）。我们可以通过分析这些数据来系统地理解并解决任务的核心问题（underlying tasks）。

Complexity of the policy/value vs dynamics:

Is it easier to decide which action is best, or to predict what is going to happen?

Some problems can have complex dynamics but a simple optimal policy or value function. For instance, consider the problem of learning to swim. Predicting the movement requires understanding fluid dynamics and vortices while the optimal policy simply consists in moving the limbs in sync.
Conversely, other problems can have simple dynamics but complex policies/value functions. Think of the game of Go, its rules are simplistic (placing a stone merely changes the board state at this location) but the corresponding optimal policy is very complicated.

From an intuitive standpoint, model-free reinforcement learning can be effectively employed for the first group of problem types, while model-based reinforcement learning is more suitable for the second group.

注：改写说明：

将"Oftentimes"改为"Typically"
将"exhibit a particular structure"改为"possess specific characteristics"
将"It can also be smooth"改为"These attributes may also include smoothness"
将"invariant to translations"改为"translational invariance"
将"This knowledge can then be incorporated in machine learning models"改为"This underlying knowledge can then be integrated into machine learning models"
将"In contrast..."改为"Conversely..."
"policy decisions or value function"改为"value functions"
"think of a collision vs near-collision state."改为"consider the difference between collision and near-collision states."

Generally speaking, it is widely acknowledged that model-based approaches demonstrate a more rapid learning capability compared to model-independent methods (see e.g. [Sutton, 1990]).

We might desire to know how a policy behaves prior to implementing it specifically for safety checks.
Model-free reinforcement learning is restricted in recommending actions at the current moment without predicting their outcomes.
To acquire the trajectory, one has no option but to execute the policy.
In stark contrast, model-based methods are more interpretable in that they allow us to investigate the intended and predicted trajectories of policies.

全部评论 (0)

还没有任何评论哟~

Model-Based Reinforcement Learning

ModelBasedReinforcementLearning Principle WeconsidertheoptimalcontrolproblemofanMDPwithaknownrewardf...

Model-Based Value Expansion for Efficient Model-Free Reinforcement Learning

文献目录作者通过一种很巧妙的方式把环境模型引入到ModelfreeRL算法中（包括了shorthorizon和TDk技巧）。传统方法对于Q值的预估（比如DDPG/DQN）是通过bootstrappe...

论文学习—Model-based Adversarial Meta-Reinforcement Learning

ModelbasedAdversarialMetaReinforcementLearning Abstract 1\.Introduction 2\.Relatedwork 3Preliminarie...

【论文阅读】Learning to Paint with Model-based Deep Reinforcement Learning

LearningtoPaintwithModelbasedDeepReinforcementLearning Abstract Introduction Relatedwork PaintingAge...

Policy-based Reinforcement learning

强化学习这一章会讲基于策略的强化学习 PolicybasedReinforcementlearning 强化学习前言一、policy函数二、statevaluefunction 三、ploic...

Value-Based Reinforcement Learning（2）

TemporalDifference（TD）Learning 上节已经提到了如果我们有DQN，那么agent就知道每一步动作如何做了，那么DQN如何训练那？这里面使用TD算法。

二、Value-Based Reinforcement Learning

由于在看DRL论文中，很多公式都很难理解。因此最近在学习DRL的基本内容。再此说明，非常推荐B站“王树森老师的DRL强化学习”本文的图表及内容，都是基于王老师课程的后自行理解整理出的内容。

Value-Based Reinforcement Learning-DQN

强化学习这一章会讲DQN算法，并且用TD算法来训练DQN。 ValueBasedReinforcementLearningDQN 强化学习前言一、ActionValueFunctions 二、D...

Contrastive learning-based agent modeling for deep reinforcement learning

这篇论文提出了一种名为ContrastiveLearningbasedAgentModelingCLAM的新方法，用于在多智能体系统中进行深度强化学习。以下是论文的主要内容总结：问题背景：多智能体...

Reinforcement Learning with Deep Energy-Based Policies

摘要：我们提出了一种方法，用于学习连续状态和动作的基于能量的表达策略，这在以前的表格域中是可行的。我们将我们的方法应用于学习最大熵策略，从而产生一种称为软Q学习的新算法，该算法通过玻尔兹曼分布表达最优...

是否确定退出登录?

Model-Based Reinforcement Learning

Model-Based Reinforcement Learning

Motivation

稀疏奖励

Complexity of the policy/value vs dynamics:

全部评论 (0)

相关文章推荐

Model-Based Reinforcement Learning

Model-Based Value Expansion for Efficient Model-Free Reinforcement Learning

论文学习—Model-based Adversarial Meta-Reinforcement Learning

【论文阅读】Learning to Paint with Model-based Deep Reinforcement Learning

Policy-based Reinforcement learning

Value-Based Reinforcement Learning（2）

二、Value-Based Reinforcement Learning

Value-Based Reinforcement Learning-DQN

Contrastive learning-based agent modeling for deep reinforcement learning

Reinforcement Learning with Deep Energy-Based Policies