Q Learning is an off-policy RL method — the policy that
generates the data does not need to be the same policy that you
are improving using it. In discrete tabular setting, this can be solved
via simple dynamic programming.
should tell you the discounted
sum of rewards, and thus can be framed recursively as a local
update.
Given state
, action
, reward
for taking
at
, discount factor ![]()
![]()
Thus, written a least squares loss function ![]()
![]()
In discrete space,
is
computable as we can enumerate all possible actions. In continuious
action space, this is not possible. Thus, we must replace this
exhaustive max with a learned “actor” that takes actions, with the Q
function taking the role of “critic”.
In DDPG, the actor learns a simple determinstic mapping from state
to action
, with noise added for
exploration during data collection, i.e.
![]()
during training and
![]()
during inference. Thus, for a given batch of data, the critic can be
optimized via a modified
loss, i.e.
![]()
and then the actor optimized to maximize the
value; this can be done by
simply minimizing the negative of the
function, i.e.
![]()