Reward Hacking

Agent/Model learns an unintentional (undesirable) behaviour to achieve the high reward.

This happens because:

We thought that the given reward will induce some particular kind of behaviour that we want the agent to execute to complete the task.
- But the agents completes the task (achieves high reward) but by using the behaviour that's not be wanted because of a few reasons.
Reward hacking can happen because Reward doesn't have 1-to-1 mapping withe policy. Multiple policies can achieve same reward.
It is related to problem of reward misspecification.
For example, give a bipedal agent, the reward is 1 for covering some distance forward which corresponds to task of moving forward, there's multiple policy possible
- Walk
- Crawl
- Roll
- But out of all these, for us, we maybe only intended for the agent to use walk policy and other's are undesirable.

Last updated 6 months ago