Model Free Prediction and Control
Doing things without knowing the model.
Last updated
Doing things without knowing the model.
Last updated
Prediction in this context simply means, predicting the state or state-action value function. It's same as Policy Evaluation. Where policy evaluation simply seems "evaluating" the policy ie. find the value function under the policy.
In this method, to find the values of states, We perform episodes and the returns from all episodes are averaged for each state to gets its value. Basically here we calculate by getting R from episodes ( ). Hence V(S) = average of all generated from episodes.
Monte Carlo learning is unbiased with high variance (opposite to TD learning, which is biased with low variance)
MC policy evaluation uses empirical mean return instead of expected return
Two types: First Time and Every Time
To evaluate state s
The first time-step t that state s is visited in an episode,
Increment counter N(s) ← N(s) + 1
Increment total return S(s) ← S(s) + G t
Value is estimated by mean return V (s) = S(s)/N(s)
By law of large numbers, V (s) → v π (s) as N(s) → ∞
To evaluate state s
Every time-step t that state s is visited in an episode,
Increment counter N(s) ← N(s) + 1
Increment total return S(s) ← S(s) + G t
Value is estimated by mean return V (s) = S(s)/N(s)
Again, V (s) → v π (s) as N(s) → ∞
Update V (s) incrementally after episode S 1 , A 1 , R 2 , ..., S T
Simplest TD learning algorithm: TD(0)
This approximation is essentially the one-step application of the Bellman operator on the estimate of V. This idea is called bootstrapping and is a cornerstone of temporal difference learning.
TD can learn online after every step. MC must wait for the episode to complete before return is known.
TD works in non-terminating environment. MC works only in terminating environment.
Return depends on many stochastic action, transitions, rewards.
TD target depends on only one stochastic action, transition, rewards.
MC and TD converge: V (s) → v π (s) as experience → ∞
Batch Mode: having k episodes only
Apply for MC or TD(0) to episode k.
MC converges to solution with minimum least squared error
TD(0) converges to the solution of max likelihood Markov model
TD exploits Markov property
MC does not exploit markov property
Bootstrapping: updates involves an estimate
MC does not bootstrap
DP bootstraps
TD bootstraps
Smapling: update samples an expectation
MC samples
DP does not samples
TD samples
like so on
Update the value function towards the λ-return.
Like MC, this also need complete episodes as this also require MC update to calculate λ-return.
Control in this context means how to control ie. what actions to take in the environment.
In Off-Policy method the agent learns the value function for the optimal policy but acts according to another policy. On-policy methods mean the agent learns the value function of the policy dictating its behavior
This same as n step return in TD(λ).
like so on
This allows:
Learn from observing humans or other agents
Learn about optimal policy while following exploratory policy
learn about multiple policies while following one policy
Estimate the expectation of a different distribution
It is the model-free algorithm i.e. it focus on learning the value function and use these estimates to get an optimal policy.
Where: • Q(s, a) is the estimate for the action-value function of the pair s, a • α ∈ [0, 1] is the learning rate that determines how “quickly” the new information will override the old information, • r + γ maxQ(s', a') is the learned value, which is composed of the immediate one- step reward and the estimate of the optimal future value.
The idea behind this algorithm is to transform the Q∗ formula given in Equation below into an iterative approximation procedure. At each step, we update the current estimate of Q∗ by using estimates of the optimal future value.
The above equation comes from Bellman's action value optimality equation
On the one hand, exploitation denotes the behavior of agents that select an action because it is the “best” – supposed to yield the highest reward – action. This action is dictated by the optimal, or greedy policy. On the other hand, exploration is when the actions all have the same probability of being chosen by the agents. They literally explore all their possibilities. It is easy to understand that using exclusively one of these behaviors is not efficient.
Only exploration means that, at some point, the agent has correct estimates for all state-action pairs but never uses this knowledge to maximize its rewards. It kind of defeats the whole point of learning. Without exploration, the agent can achieve an optimal reward only if it already has correct state-action estimates. However, such estimates can only be achieved through exploration. For example, if we initialize all the estimates to 0, a pure-exploitation agent will choose one action and stick to this action if the reward is positive. If the reward is negative, it proceeds with another action and eventually ends up in the same situation. Imagine a system where all the rewards are positive. The first (randomly) chosen action gives the smallest possible reward; when in this state, the agent always chooses this action because its estimate is greater than the others (0). In the end, the agent is very far from the maximum reward it could have obtained if its estimates were more accurate.
In Walkins and Dayan (1992), it is proven that the Q-values converge to the optimal Q∗ if two conditions are met. One of them states that every state-action pair has to be visited infinitely often. Consequently, the agent’s policy needs to respect this condition. For this, we could simply use a policy where the actions are always chosen randomly with a uniform distribution over the action space. However, the global performance of the agent during the learning phase will be poor; we want it to maximize its return. These two opposite behaviors are called exploration and exploitation. Hence, to learn the good policy using Q-learning we have to maintain the balance between exploitation and exploration.
The optimal action a∗ is selected with a probability 1 − ϵ and a random action with a probability ϵ.
DQNs are deep learning models that combine deep convolutional neural networks with the Q-learning algorithm.An important contribution made by DQNs is the use of experience replay and a target network to stabilize the training of the Q action- value function approximation with deep neural networks. Q-network thus refers to a neural network function approximator of the Q-function:
Catastrophic forgetting or catastrophic interference (Mccloskey and Cohen (1989), Rat- cliff (1990)) denotes the tendency of neural networks to suddenly forget information that they previously learned.
Experience replay (Lin, 1992) consists in maintaining a buffer of the old experiences and train the network against them.
During the training, we use a buffer where previous experiences are stored. An experience consists of an observed state and action pair, the immediate reward obtained and the next state observed. When the buffer is full, new experiences usually replace the oldest ones in the buffer.
To train the network, we sample a batch of experiences from the buffer and use them to apply the usual backpropagation algorithm.
The advantage of this approach is that, by sampling these batches of experiences, we break the correlation between data that we get when the network is trained in the usual online fashion.
The deep Q-learning algorithm uses experience replay. An agent’s experience at a time step t is denoted by et and is a tuple (st, at, rt, st+1) consisting of the current state st, the chosen action at, the reward rt and the next state st+1. The experiences for all the time steps are stored in a replay memory, over many episodes. We then apply minibatches updates to samples of experiences.
Since the network serves as an approximator for the Q-function, each output neuron of this network corresponds to one valid action, and every action is mapped to an output neuron. Thus, after a feedforward pass of that network, the outputs are the estimated Q-values of the state-action pair defined by the input and the output’s corresponding action.
For each state S t with return G t
In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes.
Update value V(s) towards the estimated return instead of as in monte carlo.
is called the TD Target
And is called the TD error
Return is unbiased estimate of
TD Target is biased estimate of because return(G_t) is empirical and TD target is just the guess of value function.
TD target have lower variance as compared to return( ):
Repeatedly sample episode
best fit to the observed returns
Solution to the MDP best fits the data
n=1 (TD(0)),
n=2 ,
n= (MC),
The λ-return combines all the n-step returns
Here is TD(0) which is 1 step TD and is MC.
Policy Evaluation: MC policy Evaluation, Q Policy Imporvement: policy improvement
Let there be actions. We will try to assign non-zero probability to every action. But greedy actions should be given more probability. WIth probabiltiy choose the greedy action. and with probability choose a random action.
Every time-step: - Policy Evaluation: Sarsa using above equation - Police Improvement: - Greedy policy improvement
n=1 (sarsa),
n=2 ,
n= (MC),
Q(s,a) is updated for every state s and action a. In proportion to TD-error and eligibility trace
Evaluate target policy (generally optimal policy) to compute or , while following behaviour policy to take actions in the environment when interacting.
where is the weight of the network