Off Policy vs On Policy
Last updated
Last updated
In this class of methods, the policy being learned is the one which interacts with the environment to generate the data.
Hence, in this kind of method, the policy being learned has to encode for the exploration in the environment as well.
Basically, the policy being learnt here is used to approximate the target return. Hence model is trained to match the return approximated by model itself.
In this class of algorithms the policy being used to interact with the environment and collect data is not the one we are interested in learning to optimize the expected return.
I.e you will use some policy to interact with the environment and explore. But you want to learn a policy which maximizes the expected return. Hence you use the data generated by exploration policy to find the optimal policy wrt to the expected return.
Here, mostly the target return is approximated by greedy policy or something.