Value Function Approximation
SO far we have represented value function using look-up table:
- Every state s has an entry V(s)
- Every state-action pair have an entry Q(s,a)
Now it is impossible to have V(s) entry if we have large number of states. Hence, to solve this we use Value Function Approximator. Which are basically functoins which can map state to its value i.e v:s→a .
Stochastic Gradient Descent
J(w)=Eπ[(vπ(s)−v^(s,w))2] Here, vπ is the original taget and v^ is the value approximation function with parameters w. Now we will update the w using gradient descent in order to minimize the the cost J(w) . Hence
△w=α(vπ(s)−v^(s,w))▽wv^(s,w) But is the problem that we don't know the target value funtion vπ(s) in reinforcement learning before hand . We will see below how to handle this problem.
Different Value approximation
Linear Function Approximation
Feature Vector: We represent state using a vector as below.
Using this state feature vector and weights. Linear approximation is as follows:
v^(s,w)=x(s)Tw=∑xj(s)wj Hence in this case update become:
△w=α(vπ(s)−v^(s,w))x(s) Table Lookup Features: It is special case for linear function approximation. Here the state vector have all the states in itself. see below:
Incremental Prediction Algorithms
Here we talk about the ways in which we can substitute the value of vπ(s) from the the weights.
Monte-Carlo with Value Funtoni Approx.
Use Gt calucuated from the episodes here in place of vπ(s). We will have
<(S1,G1),(S2,G2),...(ST,GT)> , which will be ou training data.
Hence using linear Monte-carlo policy evaluation, algorithm will be:
△w=α(Gt−v^(st,w))x(st) Monte-Carlo evaluation will converge to a local optimum. Even with non-linear value funtion approximation.
TD Learning with value funtion approx.
The TD-target Rt+1+γv^(St+1,w) is a biased sample of true value vπ(st).
Supervised learning can be applies using traingin data: <(S1,R2+γv^(S2,w)),(S2,R3+γv^(S3,w)),...(ST−1,RT)>
△w=α(Rt+1+γv^(St+1,w)−v^(st,w))x(st) Linear TD(0) coverges(close) to global optimum.
TD() with value funtion approximation.
The λ-return Gtλ is also a biased sample of true value vπ(st)
Apply Supervised learning with following data: <(S1,G1λ),(S2,G2λ),...(ST,GTλ)>
Forward View linear TD():
△w=α(Gtλ−v^(st,w))x(st) Backward view linear TD():
δt=Rt+1+γv^(St+1,w)−v^(st,w)Et=γλEt−1+x(St)△w=αδtEt Forward and backward view llinear TD() are equivalent.
Incremental Control Algorithms
Policy Evalutaion: Approximate policy evaluation: q^(.,.,w)≈qπ
Policy Improvement : ϵ−greedy policy improvement
Here
J(w)=Eπ[(qπ(S,A,w)−q^(S,A,w))2] △w=α(qπ(S,A,w)−q^(S,A,w))△wq^(S,A,w) Linear approximation:
q^(s,a,w)=x(s,a)Tw=∑xj(s,a)wj Algorithms
Now we have to substitute a target for qπ(S,A)
MC: The taget wiill be return Gt
△w=α(Gt−q^(S,A,w))△wq^(S,A,w) TD(0): The target is TD target Rt+1+γq^(St+1,At+1,w)
△w=α(Rt+1+γq^(St+1,At+1,w)−q^(S,A,w))△wq^(S,A,w) Same goes with forward and backward view TD(lambda).
Sarsa Control with action-value funtion approximation Gradient Temporal-Difference Learning
Batch Methods
Least Square Prediction
Given value function v^(s,w) approximation amd experience D consisting of <state, value> pairs. D={<s1,v1π>,<s2,v2π>,...<st,vtπ>}
Learning paramters w for the best fitting value funtionv^(s,w).
Least-square algorithm: find w minimizing the least square error between apprx funtion and target values.
LS(w)=t=1∑T(vtπ−v^(st,w))2=ED[(vπ−v^(s,w))2] SGD with experience replay
Sample state, value from experience
<s,vπ>∼D
Apply SGD update
△w=α(vπ−v^(s,w))▽wv^(s,w)
This converges to least square solution
wπ=argwminLS(w) Experience Replay in Deep Q-Networks
Deep Q network using experience replay Least Square Control
Least square Action-Value Function Approximation
Approximate action-value funtion.
Minimize least square error between q^(s,a,w) and qπ(s,a,w) from experiences generated using policy π consisting of <(state, actoin), value> pairs D={<(s1,a1),v1π>,<(s2,a2),v2π>,...<(st,at),vtπ>}