Value Function Approximation
Value Function Approximation
SO far we have represented value function using look-up table: - Every state s has an entry V(s) - Every state-action pair have an entry Q(s,a)
Now it is impossible to have V(s) entry if we have large number of states. Hence, to solve this we use Value Function Approximator. Which are basically functoins which can map state to its value i.e .
Stochastic Gradient Descent
Here, is the original taget and is the value approximation function with parameters w. Now we will update the using gradient descent in order to minimize the the cost . Hence
But is the problem that we don't know the target value funtion in reinforcement learning before hand . We will see below how to handle this problem.
Different Value approximation
Linear Function Approximation
Feature Vector: We represent state using a vector as below.
Using this state feature vector and weights. Linear approximation is as follows:
Hence in this case update become:
Table Lookup Features: It is special case for linear function approximation. Here the state vector have all the states in itself. see below:
Incremental Prediction Algorithms
Here we talk about the ways in which we can substitute the value of from the the weights.
Monte-Carlo with Value Funtoni Approx.
Use calucuated from the episodes here in place of . We will have , which will be ou training data.
Hence using linear Monte-carlo policy evaluation, algorithm will be:
Monte-Carlo evaluation will converge to a local optimum. Even with non-linear value funtion approximation.
TD Learning with value funtion approx.
The TD-target is a biased sample of true value .
Supervised learning can be applies using traingin data:
Linear TD(0) coverges(close) to global optimum.
TD() with value funtion approximation.
The λ-return is also a biased sample of true value
Apply Supervised learning with following data:
Forward View linear TD():
Backward view linear TD():
Forward and backward view llinear TD() are equivalent.
Incremental Control Algorithms
Policy Evalutaion: Approximate policy evaluation: Policy Improvement : policy improvement
Here
Linear approximation:
Algorithms
Now we have to substitute a target for
MC: The taget wiill be return
TD(0): The target is TD target
Same goes with forward and backward view TD(lambda).
Gradient Temporal-Difference Learning
Batch Methods
Least Square Prediction
Given value function approximation amd experience D consisting of <state, value> pairs.
Learning paramters w for the best fitting value funtion.
Least-square algorithm: find minimizing the least square error between apprx funtion and target values.
SGD with experience replay
Sample state, value from experience
Apply SGD update
This converges to least square solution
Experience Replay in Deep Q-Networks
Least Square Control
Least square Action-Value Function Approximation
Approximate action-value funtion.
Minimize least square error between and from experiences generated using policy consisting of <(state, actoin), value> pairs
Last updated