Advertisement

Artificial Intelligence - Chapter 7 Reinforcement Learning

阅读量:

Why reinforcement learning?

When we do dynamic programming, we assume that we know the transition function (the model )and reward function. But in the real world, we often don't have access to the true models and information for reward function. Even if we can define them, they could be very difficult to compute them. Just imagine we have thousands of states and different probabilities for the transition model, it would be hard to keep track of them.

So we introduce reinforcement learning to solve these decision problems ** not by exhaustive simulation** which is what dynamic program is doing but through**using samples** by interacting with the environment and keeping track of what we see and using those samples to formulate policies and estimate the value functions.

We put the graph into use in reinforcement learning since we don't know the underlying models. Our agent will now explicitly go into the real world and interact with it. It will be doing so via the feedback loop in the graph: the agent will take actions, they could be optimal or sub-optimal actions, it's just going to try different things and upon doing so, the environment will tell the agent what's the resultant state by taking one specific action and what rewards the agent will receive step by step.

By doing this many times and collection these information, we could eventually formulate an optimal policy or value function. Again, it's all without knowing what the underlying transtion model or reward function is.

One could imagine that is if we don't know the underlying model, why don't we just approximate it and then if we have the estimated model, we can then go back to dynamic programming and use method from last chapter to solve from policy and value functions. This is one major approach "model-based method"

(When we say learning underlying model, we literally mean learning the transtion and the reward function)

But here we will mostly focus on the model-free method.

The point is, we acknowledge that there is an underlying model, but we're not going to worry about learning it explicitly. This method will learn directly either the value function or the policy from data and samples and observations alone. The advantage of model-free method is that we don't have to worry about learning the model first.

Even if you're able to learn to learn an approximation of the model, you still have to worry about whether it's a good approximation. Or even we have a valid model, the dynamic programming is still too difficult to perform because we have to go through all states and actions.

Model-free method is not really limited by how much we know about the model or the complexity of the model.

There're mainly two model-free method that we're gonna talk about today: Monte Carlo method and temporal difference method. Let's start with Monte Carlo first.

In Monte Carlo method, if we want to estimate the value of a certain state, we would pop the agent into that state and let the agent go following a given policy for many times. The we see what it observes and what rewards it gets. That will give you an estimate for the current state value.

For example, we want to estimate the state value of s. The triangles are different states and the green circles are chance nodes. The agent will start from OR just go through the state s and end when it hits a terminal state.

From Episode 1, we can get an estimate of the utility, we call it G1 of s, using what we see in the knowledge of transitions and rewards after starting from state s. Then we do this again in the second episode and maybe this time we get a different estimate G2 of s. It's totally fine because we're in a stochastic world so we can't expect we have exactly the same results each time.

Then in the episode 3, maybe we don't actually have to start from state s, we could start in some other state but eventually made our way to state s. In this situation, we will follow the definition: when we estimate a value for a certaion state, we only use the information after we see that state. It means that in this episode, we will only use a portion of the entire branch to get the estimate G3 of s.

So if we have many episodes and see the state s in all of them (not necessarily to start from s), we can then average all the G_i and it would give you a rough approximation for what true value of state s is when following a given policy.

The key of the Prediction problem is that we already have a given policy for predicting values. (The policy is not necessarily the optimal one)

The method we used in the example is called First visit Monte Carlo method, this is using the fact that we only use the rewards that we see after having visited a given state to estimate the value of that state.

We generate sequences: the agent start from some initial state S0, and it takes action0, observe the reward and next state S1. It keeps recording State, Action, Reward (SAR) until the episode ends. We define an end of one episode as when it hits a terminal state or after the agent is running in some given fixed time. This makes sure that we are dealing with finite problems.

In a sequence, if we're interested in a specific state s_t (t could be any index from s_0 to s_{T-1}), the estimate of value for s_t is going to be the discounted sum of rewards from the most immediate reward after encountering s_t and going all the way into the end of episode.

um_{j=t+1}{T}\gamma{j-t-1}r_j

The above is all about one episode, we run the agent multiple times and collecting sequences that contain the state s_t , then the value function for a given fixed policy can be computed for each state by averaging the returns that we see over each episode:

In this case, we are given a policy that let the agent go left no matter what state it's in. Also, here we don't know the transtion (if we intend to go left, we don't know we will actually do so in what probability), and reward function (though it's shown in the graph, the agent actually don't know that).

Let's put our agent in this grid world and start randomly, we get 3 sequences.

Let's explain what's happening in the episode 1: The agent start at state A, it then follow the policy to go left, it gets a reward of +3 and observes that it's now still in state A. Then in the next step, it still follows the policy to go left, but this time it actually moves to the right, getting a reward of -2 and moving to the state B. It keeps following the policy and making observations until it goes through 5 transitions.

All these episodes will give us information about transtion and reward function. And Monte Carlo is able to take this information and push that into an estimate of the value function.

If we are intersted in state A, we check it in these episode. For episode 1, we encounter A at the very beginning, so we start counting rewards from that point forward, and get the discounted rewards from then on. For episode 2, we also see A in the beginning. But in episode 3, we don't see A until to almost the end of that episode, but that's ok, this still gives us an valid estimate for state A, we're only using one discount factor because there is just one step after A appears then the episode ends. Again, this is why it's called first-visit Monte Carlo.

For other state, it's the same sense, we don't start tabulating our values until the first visit to the given interested state.

One question to be noticed: What it we don't see a state at all in a given episode?

We just treat that episode as an non-existent. For example, if the episode 2 is:

Episode 2: (A,-2,B,+3,A,-2,B,+3,A,-2)

Then if we want to estimate value of state C, we may just ignore the episode 2, it means we only have to consider two returns:
V^{i} = rac{1}{2}+G_3

while the estimate of state A or C is still average over the three.

One good thing for Monte Carlo method is that t he estimate of different state values are independent of each other. This is in contrast to dynamic programming. It means if you're trying to estimate the value of a particular state, you don't need to know what the neighboring state values are.

In dynamic programming, the relationship between current state and its successors is literally baked into the Bellman equation. So the accuracy of a given state value depends highly on the accuracy of knowing the values of successor states. In this situation, all states are sort of interconnected.

In Monte Carlo, each state is completely independent, if we want to know V^{i}, we don't need to care about V^{i} at all. It means that if we're only interested in certain states, not the entire state space, then the calculation of these states is not going to depend on the size or complexity of the state space.

So this is the example of where even if we know the underlying transtion model and reward function, if we have a huge state space or action space, the dynamic programming may still not be very feasible. As a result, we might still opt for Monte Carlo, a model-free method (sample based approach), because the accuracy of estimating a certain state value doesn't depend on if you know the entire problem (the entire model), we don't have to solve the entire MDP!

Another variation of Monte Carlo is to compute the estimates online:

In the previous example, we firstly have those sequences generated, then come back and compute. But in general, we want to continuously update our predictors as data is coming in. So doing the prediction/ learning online is particularly helpful when we are continuously acquiring data.

It's straightforward to have the formula in the slide, where we continuously have Gt comes in, and we keep updating V^{i}.

Now we make a small transformation to this formula and get the term on the right. It is now a form of old value plus a weighted error.

As we have more episodes over time, the update will become smaller since the weight shrinks. It makes sense because when we have more and more data, things sort of start to smooth out, any new information will be added to this ongoing and larging database of obsevations, so the update will also shrinks over time.

This is another way of doing Monte Carlo prediction, and it's called constant alpha Monte Carlo. This basically entails taking this equation and then replacing the term 1/(N+1) with a tune parameter alpha.

If we just using lpha = rac{1}{N+1}, it starts out high and decreases to 0 over time. This essentially allows us to define alpha in different ways. Typically we still want alpha to start out large and go to 0 over time so that updates will become smaller and smaller so that converges to a set of values.

One thing about episodic problem is that it should be divided into chunks of episodes. In the previous grid world example and many other real problems, there may be no terminal states and defining episodic strcuture may not make sense. Just like in the previous example, we define each episode lasts five actions but actually there was no reason for us to do so. The values we get from Monte Carlo would have that assumption baked in, but it would not be something that we're interested in.

So now we want to extend Monte Carlo method to generalize these problems without episodic structure. This is : temporal-difference method.

In this method, we will go back to what we saw from dynamic programming: the state values and their accuracy depend on neighboring state values.

We slightly modify the expression of Gt in the error term:
tranform to-->

In this new expression, we have more states involved (The red term comes directly from Bellman equation). It means that as soon as we go to our successor state/ as soon as we make one transtion, we can update the value of the state that we just came from. This is not required us to wait until the end of an episode to update a state value, so we don't need a episodic structure here.

What happens if we jump to a successor state and we have no estimates of that successor state?

At the very beginning, we have all state values randomly initialized and they're just all wrong. We are using wrong values to update other values. The only useful information in early phase comes from the immediate rewards r_{t+1}. But the idea is just like in dynamic programming, all the state values would definitely start to converge toward the optimal ones, and as each state starts to converge, their neighboring state values will also start to converge.

So we would say temporal difference is essentially stimulating dynamic programming, but using samples rather than a complete exhaustive sweep over the entire state space.

We also see a tree structure in TD0, but again, we just have one branch because we're not able to consider all successors of any node. We just follow the policy and go one random path down, we don't really know where the enviroment will take us to. To estimate a certain state, we will just use the information from its successor one layer down. Like s' will use the estimate value of s'' to update its value.

(We actually generate an infinite long sequence but we just show a portion in this case.)

For the first transition, we would have the update:
V^{i} ets V^{i}+0.5 - V^{i}, V^{i}=0 (the successor state is still A)

then the update of V^{i}=1.5

For the second, we move from A to B:
V^{i} ets V^{i}+0.5 - V^{i}, V{\pi}(A)=1.5,V{i}=0

then the update of V^{i}=-0.25

For the third transition, we will update B and keep A and C untouched.
V^{i} ets V^{i}+0.5 - V^{i}, V{\pi}(B)=0,V{i}=0

then the update ofV^{i}=0.5

The key is: each transition entails only one value updates , and the state whose value is updated is the state that we just came from. And other state values remain the same.

We only update the most recent state that we came from , not the ones before. You might think that shouldn't we update all the ones before because the rewards we see now should also impact the values from the states before? (Equal question: why is it unnecessary to update all state values in one transiton?)

This is true by definition, but we don't have to do that. Firstly, it's computationally impossible when the episode might be very long and we have too many states to update. Also, we don't have to do that with enough samples, if we want to update state A, we will eventually see A again. Like the sequence below, we didn't see state A in the middle of this sequence, it indeeds means that we are not updating A during this, but for B and C, their values will be updated in this time and the changes would then be reflected into A once we encounter A again later.

So we don't actually have to update all the states in the history because that history is being recorded as state value in some certain state at each transition and those values will eventually share their information with each other.



Here the white circles are max nodes, black circles are chance nodes. And greens are terminals.

In this graph, we see dynamic programing is sort of exhaustive search, because all state values are computed using all successors state values. And we're mapping over all actions.

And for Monte Carlo, we assume that we have some episodic structure. So it uses the rewards that we see along a complete path from start to finish. So the estimate of value of state s_t depends on all the rewards and observations we see in the entire path. And we do this again many episodes to get a good estimate.

In Temporal difference, we don't have to worry about having terminal states at all. We can compute and update the value of s_t as soon as we get to s_{t+1}.

These are three different approaches and again, in the limit of many iterations for dynamic programming, or many samples for MC or TD, the value estimates will eventually converge to the right ones. Then once we have the good value estimates, we can either tell how good this given policy is or go ahead and start to improve our policy which is what we are gonna talk about next.


In the case that we are not given a specific fixed policy, we will follow this epsilon greedy policy.
|A| refers to the number of actions we can take for a state s.

What is Q refers to here?

Remember that in reinforcement learning, we don't use/have the transtion and reward function!

In this case, we cannot extract the better policy from Bellman equation.

So now, what we have to do is not considering the state values (values on the max nodes), but the values on the chance nodes(We are just computing these values by MC or TD method if you review)! Here we're going to define a function that assigns values to these chance nodes and the function takes to account both a state and an action. We call it Q function. Now we need to learn a Q function for extracting the policy.

Once we have learned Q then extracting a policy will be easy, we can just use argmax() over Q function to get the optimal policy. Because now we don't have to worry about the transition model, we are just going one layer down (state s to chance node). On the other hand, for the first expression, we need to go down for two layers (from s to s'), but we can't do that because again we don't have the transition.

Q values also satisfy a bellman relationship above but we're not learning this. We don't know the transition so we're never really executing or using that relationship. But if we have learned the optimal Q values, they do satisfy this equation.

Q (s,a) function breaks down the value of a state by action, so it's a more fine-grained picture of a state and its potential actions. (a is the action we took as a r esult of epsilon greedy policy)

State values are constantly changing, how can we use the past information to judge the future?

Again the idea is that everthing will be updated eventually, and while you don't konw the future, you take the action and go on to look at other states, updating those values. When you come bakc to this state later, by then you will have updated information. The other states will be updated enough so that you can incorporate those into the current state value.

Compared with V function, Q function now breaks down the utility function by both state and action. We're now keeping track of not only the value of a state, but also the value of starting from that state and taking a particular action.

Like this grid world, in V function, we have 3 returns V(A), V(B), V(C). In Q functions, we have more: Q(A,L), Q(A,R), Q(B,L), Q(B,R), Q(C,L), Q(C,R).

Each state has one Q value per action, and the V value for a state is the highest Q value for that state.


Here the behavior policy is not given, we may generated it by epsilon greedy, so the policy like this is not fixed, it would change as Q values update over time.

We generate the sequence by starting from state s, take an action a, observes state s' and reward. Now in SARSA method we also generate the action a' using the policy (e.g epsilon greedy)

We get all the information we need and we can perform the update. Then we move to the next state with that specific action a', and we keep move forward.

Q learning is pretty much the same as SARSA, the only difference is what action a' we will use.

In this method, a' is the action that we will take in our successor state, it is now the best one available (the greedy one). We will use max() over all actions in the set of Q values for s' .

Note that when we use the greedy action, it may not be the actual action taken in the state s', because when we actually continue and go back from the first line in the loop, we may still run a=i, it could be an exploring action (random action) or exploiting action (greedy action). So the greedy action we use here is only for updating the Q values in Q-learning method.

So Q-learning is sort of optimistic because it assumes the best action is taken even if that's not the case whereas SARSA is more realisitic because it uese the action that will actually be taken in s'.


max Q(A,a) = Q(A,L) here.

If our action space is huge, then SARSA is more efficient than Q learning because we never have to exhaust the actions to see which gives the max return.

The performance of SARSA gives a safer path and high rewards on average because it's realistic. The update and the Q values that we use in the successor state is the true one which could entail falling off the cliff. In this case, the agent will learn that the state-action pairs (s,a) that lead it to or near the cliff are bad. So it would avoid the cliff as much as possible, results in the safer path away from cliff.

The performance of Q learning gives a shorter path and lower rewards on average because it's optimistic. Q learning assumes that it will never take a bad action, so if the agent already know that the cliff is bad, when it is near the cliff, Q learning assumes that the next action it takes won't be falling the cliff (Even if the cliff is there, Q-learning will assume it would never fall) . But this kind of greedy action is only used for updating Q values, it is still possible for the agent to take the step ends up in falling due to probability of exploration. Q learning sort of accepts the fact that sometimes we might fall to the cliff, but that's fine because the path is optimal (shortest path).

Then that brings us to the question of which method to choose for training a real system?

Let's say we're training a robot.

If the robot is expensive, then we don't want it to fall. Then SARSA would be a better approach, we would be conservative on the policy, we don't necessarily learn the best policy but a good one is enough. The point is we try to conserve/minimize the losses in the course of exploration.

If we're training the robot in simulation, then Q learning could be a good approach because even though the average reward is low but we're not performing in the real world, the loss doesn't really matter in real life. So Q learning would be good when punishments and costs don't really affect you all that much.

全部评论 (0)

还没有任何评论哟~