Value Function Approximation
Last updated
Last updated
SO far we have represented value function using look-up table: - Every state s has an entry V(s) - Every state-action pair have an entry Q(s,a)
Now it is impossible to have V(s) entry if we have large number of states. Hence, to solve this we use Value Function Approximator. Which are basically functoins which can map state to its value i.e .
Here, is the original taget and is the value approximation function with parameters w. Now we will update the using gradient descent in order to minimize the the cost . Hence
But is the problem that we don't know the target value funtion in reinforcement learning before hand . We will see below how to handle this problem.
Feature Vector: We represent state using a vector as below.
Using this state feature vector and weights. Linear approximation is as follows:
Hence in this case update become:
Table Lookup Features: It is special case for linear function approximation. Here the state vector have all the states in itself. see below:
Hence using linear Monte-carlo policy evaluation, algorithm will be:
Monte-Carlo evaluation will converge to a local optimum. Even with non-linear value funtion approximation.
Linear TD(0) coverges(close) to global optimum.
Forward View linear TD():
Backward view linear TD():
Forward and backward view llinear TD() are equivalent.
Here
Same goes with forward and backward view TD(lambda).
This converges to least square solution
Approximate action-value funtion.
Here we talk about the ways in which we can substitute the value of from the the weights.
Use calucuated from the episodes here in place of . We will have , which will be ou training data.
The TD-target is a biased sample of true value .
Supervised learning can be applies using traingin data:
The λ-return is also a biased sample of true value
Apply Supervised learning with following data:
Policy Evalutaion: Approximate policy evaluation: Policy Improvement : policy improvement
Now we have to substitute a target for
MC: The taget wiill be return
TD(0): The target is TD target
Given value function approximation amd experience D consisting of <state, value> pairs.
Learning paramters w for the best fitting value funtion.
Least-square algorithm: find minimizing the least square error between apprx funtion and target values.
Sample state, value from experience
Apply SGD update
Minimize least square error between and from experiences generated using policy consisting of <(state, actoin), value> pairs