Value Function Approximation

Value Function Approximation

SO far we have represented value function using look-up table: - Every state s has an entry V(s) - Every state-action pair have an entry Q(s,a)

Now it is impossible to have V(s) entry if we have large number of states. Hence, to solve this we use Value Function Approximator. Which are basically functoins which can map state to its value i.e v:sav: s \rightarrow a .

Stochastic Gradient Descent

J(w)=Eπ[(vπ(s)v^(s,w))2]J(w) = E_\pi[(v_\pi(s)-\hat{v}(s,w))^2]

Here, vπv_\pi is the original taget and v^\hat{v} is the value approximation function with parameters w. Now we will update the ww using gradient descent in order to minimize the the cost J(w)J(w) . Hence

w=α(vπ(s)v^(s,w))wv^(s,w)\triangle w = \alpha (v_\pi(s)-\hat{v}(s,w))\triangledown_w\hat{v} (s,w)

But is the problem that we don't know the target value funtion vπ(s)v_\pi(s) in reinforcement learning before hand . We will see below how to handle this problem.

Different Value approximation

Linear Function Approximation

Feature Vector: We represent state using a vector as below.

Using this state feature vector and weights. Linear approximation is as follows:

v^(s,w)=x(s)Tw=xj(s)wj\hat{v}(s,w) = x(s)^Tw = \sum x_j(s)w_j

Hence in this case update become:

w=α(vπ(s)v^(s,w))x(s)\triangle w = \alpha (v_\pi(s)-\hat{v}(s,w))x(s)

Table Lookup Features: It is special case for linear function approximation. Here the state vector have all the states in itself. see below:

Incremental Prediction Algorithms

Here we talk about the ways in which we can substitute the value of vπ(s)v_\pi(s) from the the weights.

Monte-Carlo with Value Funtoni Approx.

Use GtG_t calucuated from the episodes here in place of vπ(s)v_\pi(s). We will have <(S1,G1),(S2,G2),...(ST,GT)><(S_1, G_1), (S_2, G_2),...(S_T, G_T)> , which will be ou training data.

Hence using linear Monte-carlo policy evaluation, algorithm will be:

w=α(Gtv^(st,w))x(st)\triangle w = \alpha (G_t-\hat{v}(s_t,w))x(s_t)

Monte-Carlo evaluation will converge to a local optimum. Even with non-linear value funtion approximation.

TD Learning with value funtion approx.

The TD-target Rt+1+γv^(St+1,w)R_{t+1} + γ v̂ (S_{t+1} , w) is a biased sample of true value vπ(st)v_\pi(s_t).

Supervised learning can be applies using traingin data: <(S1,R2+γv^(S2,w)),(S2,R3+γv^(S3,w)),...(ST1,RT)><(S_1, R_2 + γ v̂ (S_2 , w)), (S_2, R_3 + γ v̂ (S_3 , w)),...(S_{T-1}, R_T)>

w=α(Rt+1+γv^(St+1,w)v^(st,w))x(st)\triangle w = \alpha (R_{t+1} + γ v̂ (S_{t+1} , w)-\hat{v}(s_t,w))x(s_t)

Linear TD(0) coverges(close) to global optimum.

TD() with value funtion approximation.

The λ-return GtλG_t^λ is also a biased sample of true value vπ(st)v_\pi(s_t)

Apply Supervised learning with following data: <(S1,G1λ),(S2,G2λ),...(ST,GTλ)><(S_1, G_1^\lambda), (S_2, G_2^\lambda),...(S_T, G_T^\lambda)>

Forward View linear TD():

w=α(Gtλv^(st,w))x(st)\triangle w = \alpha (G_t^\lambda-\hat{v}(s_t,w))x(s_t)

Backward view linear TD():

δt=Rt+1+γv^(St+1,w)v^(st,w)Et=γλEt1+x(St)w=αδtEt\delta_t = R_{t+1} + γ v̂ (S_{t+1} , w)-\hat{v}(s_t,w)\\ E_t = \gamma \lambda E_{t-1} + x(S_t)\\ \triangle w = \alpha \delta_t E_t

Forward and backward view llinear TD() are equivalent.

Incremental Control Algorithms

Policy Evalutaion: Approximate policy evaluation: q^(.,.,w)qπ\hat{q}(.,.,w) \approx q_\pi Policy Improvement : ϵgreedy\epsilon-greedy policy improvement

Here

J(w)=Eπ[(qπ(S,A,w)q^(S,A,w))2]J(w) = E_\pi[(q_\pi(S,A,w)-\hat{q}(S,A,w))^2]
w=α(qπ(S,A,w)q^(S,A,w))wq^(S,A,w)\triangle w = \alpha (q_\pi(S,A,w)-\hat{q}(S,A,w))\triangle_w \hat{q}(S,A,w)

Linear approximation:

q^(s,a,w)=x(s,a)Tw=xj(s,a)wj\hat{q}(s,a,w) = x(s,a)^Tw = \sum x_j(s,a)w_j

Algorithms

Now we have to substitute a target for qπ(S,A)q_\pi(S,A)

MC: The taget wiill be return GtG_t

w=α(Gtq^(S,A,w))wq^(S,A,w)\triangle w = \alpha (G_t-\hat{q}(S,A,w))\triangle_w \hat{q}(S,A,w)

TD(0): The target is TD target Rt+1+γq^(St+1,At+1,w)R_{t+1} + γ \hat{q} (S_{t+1}, A_{t+1} , w)

w=α(Rt+1+γq^(St+1,At+1,w)q^(S,A,w))wq^(S,A,w)\triangle w = \alpha (R_{t+1} + γ \hat{q} (S_{t+1}, A_{t+1} , w)-\hat{q}(S,A,w))\triangle_w \hat{q}(S,A,w)

Same goes with forward and backward view TD(lambda).

Gradient Temporal-Difference Learning

Batch Methods

Least Square Prediction

Given value function v^(s,w)\hat{v}(s,w) approximation amd experience D consisting of <state, value> pairs. D={<s1,v1π>,<s2,v2π>,...<st,vtπ>}D = \{<s_1, v_1^\pi>,<s_2, v_2^\pi>,...<s_t, v_t^\pi>\}

Learning paramters w for the best fitting value funtionv^(s,w)\hat{v}(s,w).

Least-square algorithm: find ww minimizing the least square error between apprx funtion and target values.

LS(w)=t=1T(vtπv^(st,w))2=ED[(vπv^(s,w))2]LS(w) = \sum_{t=1}^T(v_t^\pi - \hat{v}(s_t,w))^2 \\ =E_D[(v^\pi-\hat{v}(s,w))^2]

SGD with experience replay

  • Sample state, value from experience <s,vπ>D<s,v^\pi> \sim D

  • Apply SGD update w=α(vπv^(s,w))wv^(s,w)\triangle w = \alpha (v^\pi-\hat{v}(s,w))\triangledown_w\hat{v}(s,w)

This converges to least square solution

wπ=argminwLS(w)w^\pi = arg \min_w LS(w)

Experience Replay in Deep Q-Networks

Least Square Control

Least square Action-Value Function Approximation

Approximate action-value funtion.

Minimize least square error between q^(s,a,w)\hat{q}(s,a,w) and qπ(s,a,w){q}_\pi(s,a,w) from experiences generated using policy π\pi consisting of <(state, actoin), value> pairs D={<(s1,a1),v1π>,<(s2,a2),v2π>,...<(st,at),vtπ>}D = \{<(s_1,a_1), v_1^\pi>,<(s_2,a_2), v_2^\pi>,...<(s_t,a_t), v_t^\pi>\}

Last updated