🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  • Prediction
  • Monte-Carlo Policy Evaluation
  • TD Learning
  • Advantages & Disadvantages of Both:
  • Bias-Variance Trade Off
  • Batch MC and TD
  • Certainty Equivalence
  • Bootstrapping and Sampling
  • TD(λ)
  • Control
  • Off-Policy Vs On-Policy
  • On-Policy MC Control
  • Temporal-Difference Learning: Sarsa(λ)
  • Off-Policy Learning
  • Q-Learning: Off-Policy TD Control
  • Summary
  • Exploration Vs Exploitation
  • Why not pure-exploitation or pure-exploration is good?
  • wrt to Q-Learning:
  • Fix for This: ϵ-greedy
  • Deep Q-Networks
  • Experience Replay Catastrophic
  • Deep Q-Learning
  1. Reinforcement Learning

Model Free Prediction and Control

Doing things without knowing the model.

Prediction

Prediction in this context simply means, predicting the state or state-action value function. It's same as Policy Evaluation. Where policy evaluation simply seems "evaluating" the policy ie. find the value function under the policy.

Monte-Carlo Policy Evaluation

In this method, to find the values of states, We perform episodes and the returns from all episodes are averaged for each state to gets its value. Basically here we calculate Gt=Rt+1+γRt+2+...G_t = R_{t+1} + \gamma R_{t+2} + ...Gt​=Rt+1​+γRt+2​+... by getting R from episodes ( S1,A1,R2,S2,A2,R3,...S_1,A_1,R_2,S_2,A_2,R_3,...S1​,A1​,R2​,S2​,A2​,R3​,... ). Hence V(S) = average of all GtG_tGt​ generated from episodes.

Monte Carlo learning is unbiased with high variance (opposite to TD learning, which is biased with low variance)

MC policy evaluation uses empirical mean return instead of expected return

Two types: First Time and Every Time

First-Time:

  • To evaluate state s

  • The first time-step t that state s is visited in an episode,

  • Increment counter N(s) ← N(s) + 1

  • Increment total return S(s) ← S(s) + G t

  • Value is estimated by mean return V (s) = S(s)/N(s)

  • By law of large numbers, V (s) → v π (s) as N(s) → ∞

Every-Time:

  • To evaluate state s

  • Every time-step t that state s is visited in an episode,

  • Increment counter N(s) ← N(s) + 1

  • Increment total return S(s) ← S(s) + G t

  • Value is estimated by mean return V (s) = S(s)/N(s)

  • Again, V (s) → v π (s) as N(s) → ∞

Incremental Update:

  • Update V (s) incrementally after episode S 1 , A 1 , R 2 , ..., S T

  • For each state S t with return G t N(St)←N(St)+1N(S_t) \leftarrow N(S_t) +1N(St​)←N(St​)+1 V(St)←V(St)+1N(ST)(Gt−V(St))V(S_t) \leftarrow V(S_t) + \frac{1}{N(S_T)}(G_t - V(S_t))V(St​)←V(St​)+N(ST​)1​(Gt​−V(St​))

  • In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes. V(St)←V(St)+α(Gt−V(St))V(S_t) \leftarrow V(S_t) + \alpha(G_t - V(S_t))V(St​)←V(St​)+α(Gt​−V(St​))

TD Learning

Simplest TD learning algorithm: TD(0)

  • Update value V(s) towards the estimated return Rt+γV(St+1)R_t + \gamma V(S_{t+1})Rt​+γV(St+1​) instead of GtG_tGt​ as in monte carlo. V(st)←V(st)+α(Rt+1+γV(St+1)−V(St))V(s_t) \leftarrow V(s_t) + \alpha(R_{t+1}+\gamma V(S_{t+1}) - V(S_t))V(st​)←V(st​)+α(Rt+1​+γV(St+1​)−V(St​))

  • Rt+γV(St+1)R_t + \gamma V(S_{t+1})Rt​+γV(St+1​) is called the TD Target

  • And δt=Rt+γV(St+1)−V(st)\delta_t = R_t + \gamma V(S_{t+1}) - V(s_t)δt​=Rt​+γV(St+1​)−V(st​) is called the TD error

This approximation is essentially the one-step application of the Bellman operator on the estimate of V. This idea is called bootstrapping and is a cornerstone of temporal difference learning.

Advantages & Disadvantages of Both:

  • TD can learn online after every step. MC must wait for the episode to complete before return is known.

  • TD works in non-terminating environment. MC works only in terminating environment.

Bias-Variance Trade Off

  • Return Gt=Rt+1+γRt+2+..G_t = R_{t+1} + \gamma R_{t+2} + ..Gt​=Rt+1​+γRt+2​+.. is unbiased estimate of vπ(st)v_\pi(s_t)vπ​(st​)

  • TD Target Rt+γV(St+1)R_t + \gamma V(S_{t+1})Rt​+γV(St+1​) is biased estimate of vπ(st)v_\pi(s_t)vπ​(st​)because return(G_t) is empirical and TD target is just the guess of value function.

  • TD target have lower variance as compared to return( GtG_tGt​ ):

    • Return depends on many stochastic action, transitions, rewards.

    • TD target depends on only one stochastic action, transition, rewards.

Batch MC and TD

MC and TD converge: V (s) → v π (s) as experience → ∞

Batch Mode: having k episodes only

S11,a11,r21,...ST11...s1k,a1k,r2k,...STkkS_1^1, a_1^1, r_2^1,...S_{T_1}^1\\.\\.\\.\\s1^k, a_1^k, r_2^k,...S_{T_k}^kS11​,a11​,r21​,...ST1​1​...s1k,a1k​,r2k​,...STk​k​
  • Repeatedly sample episode k∈[1,K]k \in [1,K]k∈[1,K]

  • Apply for MC or TD(0) to episode k.

Certainty Equivalence

  • MC converges to solution with minimum least squared error

    • best fit to the observed returns ∑k=1K∑t=1Tk(Gtk−V(Stk))2\sum_{k=1}^K\sum_{t=1}^{T_k}(G_t^k-V(S_t^k))^2∑k=1K​∑t=1Tk​​(Gtk​−V(Stk​))2

  • TD(0) converges to the solution of max likelihood Markov model

    • Solution to the MDP best fits the data Ps,s′a^=1N(s,a)∑k=1K∑t=1Tk1(stk,atk,st+1k=s,a,s′)Rsa^=1N(s,a)∑k=1K∑t=1Tk1(stk,atk=s,a)rtk\widehat{P_{s,s'}^a} = \frac{1}{N(s,a)}\sum_{k=1}^K\sum_{t=1}^{T_k}1(s_t^k, a_t^k, s_{t+1}^k = s,a,s') \\ \widehat{R_s^a} = \frac{1}{N(s,a)}\sum_{k=1}^K\sum_{t=1}^{T_k}1(s_t^k,a_t^k = s,a)r_t^kPs,s′a​​=N(s,a)1​∑k=1K​∑t=1Tk​​1(stk​,atk​,st+1k​=s,a,s′)Rsa​​=N(s,a)1​∑k=1K​∑t=1Tk​​1(stk​,atk​=s,a)rtk​

  • TD exploits Markov property

  • MC does not exploit markov property

Bootstrapping and Sampling

  • Bootstrapping: updates involves an estimate

    • MC does not bootstrap

    • DP bootstraps

    • TD bootstraps

  • Smapling: update samples an expectation

    • MC samples

    • DP does not samples

    • TD samples

TD(λ)

n-step prediction

  • n=1 (TD(0)), Gt1=Rt+1+γV(St+1)G_t^1 = R_{t+1} + \gamma V(S_{t+1})Gt1​=Rt+1​+γV(St+1​)

  • n=2 , Gt2=Rt+1+γRt+2+γ2V(St+2)G_t^2 = R_{t+1} + \gamma R_{t+2} + \gamma^2V(S_{t+2})Gt2​=Rt+1​+γRt+2​+γ2V(St+2​)

  • like so on

  • n= ∞\infty∞ (MC), Gt∞=Rt+1+γRt+2+......+γT−1RTG_t^\infty = R_{t+1} + \gamma R_{t+2} + ...... + \gamma^{T-1}R_TGt∞​=Rt+1​+γRt+2​+......+γT−1RT​

λ-return

The λ-return GtλG_t^\lambdaGtλ​ combines all the n-step returns GtnG_t^nGtn​

Gtλ=(1−λ)∑n=1∞λn−1GtnG_t^\lambda = (1-\lambda)\sum_{n=1}^\infty\lambda^{n-1}G_t^nGtλ​=(1−λ)n=1∑∞​λn−1Gtn​

Here λ=0\lambda = 0λ=0 is TD(0) which is 1 step TD and λ=1\lambda = 1λ=1 is MC.

forward-view TD(λ)

Update the value function towards the λ-return.

V(St)←V(St)+α(Gtλ−V(St))V(S_t) \leftarrow V(S_t) + \alpha(G_t^\lambda - V(S_t))V(St​)←V(St​)+α(Gtλ​−V(St​))

Like MC, this also need complete episodes as this also require MC update to calculate λ-return.

backward-view and Eligibillty Traces

Control

Control in this context means how to control ie. what actions to take in the environment.

Off-Policy Vs On-Policy

In Off-Policy method the agent learns the value function for the optimal policy but acts according to another policy. On-policy methods mean the agent learns the value function of the policy dictating its behavior

On-Policy MC Control

Policy Evaluation: MC policy Evaluation, Q Policy Imporvement: ϵ−greedy\epsilon- greedyϵ−greedy policy improvement

ϵ\epsilonϵ- Greedy Exploaration

Let there be mmm actions. We will try to assign non-zero probability to every action. But greedy actions should be given more probability. WIth probabiltiy 1−ϵ1-\epsilon1−ϵ choose the greedy action. and with probability ϵ\epsilonϵ choose a random action.

π(a∣s)={ϵ/m+1−ϵif a∗=argmax⁡a∈AQ(s,a)ϵmotherwise\pi(a|s) = \begin{cases}\epsilon/m+1-\epsilon & \text{if }a^*=arg \max_{a \in A} Q(s,a)\\\frac{\epsilon}{m} & otherwise\end{cases}π(a∣s)={ϵ/m+1−ϵmϵ​​if a∗=argmaxa∈A​Q(s,a)otherwise​

Temporal-Difference Learning: Sarsa(λ)

Q(S,A)←Q(S,A)+α(R+γQ(S′,A′)−Q(S,A))Q(S,A) \leftarrow Q(S,A) + \alpha(R+\gamma Q(S',A') - Q(S,A))Q(S,A)←Q(S,A)+α(R+γQ(S′,A′)−Q(S,A))

Every time-step: - Policy Evaluation: Sarsa using above equation - Police Improvement: ϵ\epsilonϵ - Greedy policy improvement

n-step Sarsa:

This same as n step return in TD(λ).

  • n=1 (sarsa), qt1=Rt+1+γQ(St+1)q_t^1 = R_{t+1} + \gamma Q(S_{t+1})qt1​=Rt+1​+γQ(St+1​)

  • n=2 , qt2=Rt+1+γRt+2+γ2Q(St+2)q_t^2 = R_{t+1} + \gamma R_{t+2} + \gamma^2Q(S_{t+2})qt2​=Rt+1​+γRt+2​+γ2Q(St+2​)

  • like so on

  • n= ∞\infty∞ (MC), qt∞=Rt+1+γRt+2+......+γT−1RTq_t^\infty = R_{t+1} + \gamma R_{t+2} + ...... + \gamma^{T-1}R_Tqt∞​=Rt+1​+γRt+2​+......+γT−1RT​

Forward View Sarsa(λ)

qtλ=(1−λ)∑n=1∞λn−1qtnQ(St,At)←Q(St,At)+α(qtλ−Q(St,At))q_t^\lambda = (1-\lambda)\sum_{n=1}^\infty\lambda^{n-1}q_t^n \\ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha(q_t^\lambda - Q(S_t,A_t))qtλ​=(1−λ)n=1∑∞​λn−1qtn​Q(St​,At​)←Q(St​,At​)+α(qtλ​−Q(St​,At​))

Backward View Sarsa(λ)

Et(s,a)=γλEt−1(s,a)+1(St=s,At=a)E_t(s,a) = \gamma \lambda E_{t-1}(s,a) + 1(S_t=s, A_t=a)Et​(s,a)=γλEt−1​(s,a)+1(St​=s,At​=a)

Q(s,a) is updated for every state s and action a. In proportion to TD-error δt\delta_tδt​ and eligibility trace Et(s,a)E_t(s,a)Et​(s,a)

δt=Rt+1+γQ(St+1,At+1)−Q(St,At)Q(s,a)←Q(s,a)+αδtEt(s,a)\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \\ Q(s,a) \leftarrow Q(s,a) + \alpha\delta_tE_t(s,a)δt​=Rt+1​+γQ(St+1​,At+1​)−Q(St​,At​)Q(s,a)←Q(s,a)+αδt​Et​(s,a)

Sarsa(λ) Algorithm:

Off-Policy Learning

Evaluate target policy (generally optimal policy) π(a∣s)\pi(a|s)π(a∣s) to compute vπ(s)v_\pi(s)vπ​(s) or qπ(s,a)q_\pi(s,a)qπ​(s,a), while following behaviour policy μ(a∣s)\mu(a|s)μ(a∣s)to take actions in the environment when interacting.

{S1,A1,R2,...,ST}∼μ\{S_1, A_1, R_2,...,S_T\} \sim \mu{S1​,A1​,R2​,...,ST​}∼μ

This allows:

  • Learn from observing humans or other agents

  • Learn about optimal policy while following exploratory policy

  • learn about multiple policies while following one policy

Importance Sampling

Estimate the expectation of a different distribution

EX∼P[f(X)]=∑P(X)f(X)=∑Q(X)P(X)Q(X)f(X)=EX∼Q[P(X)Q(X)f(X)]E_{X\sim P}[f(X)] = \sum P(X)f(X)\\ = \sum Q(X)\frac{P(X)}{Q(X)}f(X)\\ = E_{X\sim Q}[\frac{P(X)}{Q(X)}f(X)]EX∼P​[f(X)]=∑P(X)f(X)=∑Q(X)Q(X)P(X)​f(X)=EX∼Q​[Q(X)P(X)​f(X)]

Q-Learning: Off-Policy TD Control

It is the model-free algorithm i.e. it focus on learning the value function and use these estimates to get an optimal policy.

Q(s,a)←Q(s,a)+α(r+γmax⁡a′∈AQ(s′,a′)−Q(s,a))Q(s,a)←Q(s,a)+α(r+γ\max_{a'\in A}Q(s',a')−Q(s,a))Q(s,a)←Q(s,a)+α(r+γa′∈Amax​Q(s′,a′)−Q(s,a))

Where: • Q(s, a) is the estimate for the action-value function of the pair s, a • α ∈ [0, 1] is the learning rate that determines how “quickly” the new information will override the old information, • r + γ maxQ(s', a') is the learned value, which is composed of the immediate one- step reward and the estimate of the optimal future value.

The idea behind this algorithm is to transform the Q∗ formula given in Equation below into an iterative approximation procedure. At each step, we update the current estimate of Q∗ by using estimates of the optimal future value.

The above equation comes from Bellman's action value optimality equation

Q∗(s,a)=Rsa+γ∑s‘∈SPss′amax⁡a′Q∗(s′,a′)Q_*(s,a) = R_s^a + \gamma \sum_{s^`\in S}P_{ss'}^a \max_{a'} Q_*(s',a')Q∗​(s,a)=Rsa​+γs‘∈S∑​Pss′a​a′max​Q∗​(s′,a′)

Summary

Exploration Vs Exploitation

On the one hand, exploitation denotes the behavior of agents that select an action because it is the “best” – supposed to yield the highest reward – action. This action is dictated by the optimal, or greedy policy. On the other hand, exploration is when the actions all have the same probability of being chosen by the agents. They literally explore all their possibilities. It is easy to understand that using exclusively one of these behaviors is not efficient.

Why not pure-exploitation or pure-exploration is good?

Only exploration means that, at some point, the agent has correct estimates for all state-action pairs but never uses this knowledge to maximize its rewards. It kind of defeats the whole point of learning. Without exploration, the agent can achieve an optimal reward only if it already has correct state-action estimates. However, such estimates can only be achieved through exploration. For example, if we initialize all the estimates to 0, a pure-exploitation agent will choose one action and stick to this action if the reward is positive. If the reward is negative, it proceeds with another action and eventually ends up in the same situation. Imagine a system where all the rewards are positive. The first (randomly) chosen action gives the smallest possible reward; when in this state, the agent always chooses this action because its estimate is greater than the others (0). In the end, the agent is very far from the maximum reward it could have obtained if its estimates were more accurate.

wrt to Q-Learning:

In Walkins and Dayan (1992), it is proven that the Q-values converge to the optimal Q∗ if two conditions are met. One of them states that every state-action pair has to be visited infinitely often. Consequently, the agent’s policy needs to respect this condition. For this, we could simply use a policy where the actions are always chosen randomly with a uniform distribution over the action space. However, the global performance of the agent during the learning phase will be poor; we want it to maximize its return. These two opposite behaviors are called exploration and exploitation. Hence, to learn the good policy using Q-learning we have to maintain the balance between exploitation and exploration.

Fix for This: ϵ-greedy

The optimal action a∗ is selected with a probability 1 − ϵ and a random action with a probability ϵ.

at∗=argmax⁡aQt(a)at={at∗with probability 1- ϵrandom actionwith probability ϵa_t^* = arg \max_a Q_t(a)\\ a_t = \begin{cases}a_t^* & \text{with probability 1- }\epsilon\\ \text{random action} & \text{with probability } \epsilon\end{cases}at∗​=argamax​Qt​(a)at​={at∗​random action​with probability 1- ϵwith probability ϵ​

Deep Q-Networks

DQNs are deep learning models that combine deep convolutional neural networks with the Q-learning algorithm.An important contribution made by DQNs is the use of experience replay and a target network to stabilize the training of the Q action- value function approximation with deep neural networks. Q-network thus refers to a neural network function approximator of the Q-function:

Q(s,a;θ)≈Q∗(s,a)Q(s,a;\theta) \approx Q^*(s,a)Q(s,a;θ)≈Q∗(s,a)

where θ\thetaθ is the weight of the network

Experience Replay Catastrophic

Catastrophic forgetting or catastrophic interference (Mccloskey and Cohen (1989), Rat- cliff (1990)) denotes the tendency of neural networks to suddenly forget information that they previously learned.

Experience replay (Lin, 1992) consists in maintaining a buffer of the old experiences and train the network against them.

During the training, we use a buffer where previous experiences are stored. An experience consists of an observed state and action pair, the immediate reward obtained and the next state observed. When the buffer is full, new experiences usually replace the oldest ones in the buffer.

To train the network, we sample a batch of experiences from the buffer and use them to apply the usual backpropagation algorithm.

The advantage of this approach is that, by sampling these batches of experiences, we break the correlation between data that we get when the network is trained in the usual online fashion.

Deep Q-Learning

The deep Q-learning algorithm uses experience replay. An agent’s experience at a time step t is denoted by et and is a tuple (st, at, rt, st+1) consisting of the current state st, the chosen action at, the reward rt and the next state st+1. The experiences for all the time steps are stored in a replay memory, over many episodes. We then apply minibatches updates to samples of experiences.

Since the network serves as an approximator for the Q-function, each output neuron of this network corresponds to one valid action, and every action is mapped to an output neuron. Thus, after a feedforward pass of that network, the outputs are the estimated Q-values of the state-action pair defined by the input and the output’s corresponding action.

PreviousValue IterationNextOff Policy vs On Policy

Last updated 7 months ago

Understand this for clear distinction
Sarsa
sarsa
Off-Policy Monte Carlo
Deep Q-Learning Procedure