Model-Based Reinforcement Learning
Model-Based Reinforcement Learning
We investigate the optimal control problem for an MDP characterized by a determined reward function R. This problem is subject to unknown deterministic transition dynamics, specifically defined by s_{t+1} = f(s_t, a_t).
In model-based reinforcement learning, this issue is tackled in two stages:
- Model construction:
Through regression analysis on interaction data, we establish a dynamic model f_\theta \approx f. - Path computation:
Utilizing the established dynamics model f_\theta, we calculate the optimal path based on \hat{s}_{t+1} = f_\theta(\hat{s}_t, a_t).
(We can easily extend to unknown reward settings and stochastic dynamic systems, but we opt for a more straightforward scenario in this notebook to facilitate illustration)
Motivation
稀疏奖励
- 在无模型的强化学习中,在获得奖励时才会收到奖励信号。在存在稀疏奖励的环境中中,随机获得奖励的概率非常低(low probability),从而阻碍了有效的学习(effective learning)。
- 不管有没有奖励,在这种情况下我们仍然会接收到一系列的状态转移数据(state transition sequences)。我们可以通过分析这些数据来系统地理解并解决任务的核心问题(underlying tasks)。
Complexity of the policy/value vs dynamics:
Is it easier to decide which action is best, or to predict what is going to happen?
- Some problems can have complex dynamics but a simple optimal policy or value function. For instance, consider the problem of learning to swim. Predicting the movement requires understanding fluid dynamics and vortices while the optimal policy simply consists in moving the limbs in sync.
- Conversely, other problems can have simple dynamics but complex policies/value functions. Think of the game of Go, its rules are simplistic (placing a stone merely changes the board state at this location) but the corresponding optimal policy is very complicated.
From an intuitive standpoint, model-free reinforcement learning can be effectively employed for the first group of problem types, while model-based reinforcement learning is more suitable for the second group.
注:改写说明:
- 将"Oftentimes"改为"Typically"
- 将"exhibit a particular structure"改为"possess specific characteristics"
- 将"It can also be smooth"改为"These attributes may also include smoothness"
- 将"invariant to translations"改为"translational invariance"
- 将"This knowledge can then be incorporated in machine learning models"改为"This underlying knowledge can then be integrated into machine learning models"
- 将"In contrast..."改为"Conversely..."
- "policy decisions or value function"改为"value functions"
- "think of a collision vs near-collision state."改为"consider the difference between collision and near-collision states."
Generally speaking, it is widely acknowledged that model-based approaches demonstrate a more rapid learning capability compared to model-independent methods (see e.g. [Sutton, 1990]).
We might desire to know how a policy behaves prior to implementing it specifically for safety checks.
Model-free reinforcement learning is restricted in recommending actions at the current moment without predicting their outcomes.
To acquire the trajectory, one has no option but to execute the policy.
In stark contrast, model-based methods are more interpretable in that they allow us to investigate the intended and predicted trajectories of policies.
