Reward Hacking

Agent/Model learns an unintentional (undesirable) behaviour to achieve the high reward.

This happens because:

  • We thought that the given reward will induce some particular kind of behaviour that we want the agent to execute to complete the task.

    • But the agents completes the task (achieves high reward) but by using the behaviour that's not be wanted because of a few reasons.

  • Reward hacking can happen because Reward doesn't have 1-to-1 mapping withe policy. Multiple policies can achieve same reward.

  • It is related to problem of reward misspecification.

  • For example, give a bipedal agent, the reward is 1 for covering some distance forward which corresponds to task of moving forward, there's multiple policy possible

    • Walk

    • Crawl

    • Roll

    • But out of all these, for us, we maybe only intended for the agent to use walk policy and other's are undesirable.

Last updated