Reinforcement Learning

Standard Q Learning — Discrete Space

Q Learning is an off-policy RL method — the policy that generates the data does not need to be the same policy that you are improving using it. In discrete tabular setting, this can be solved via simple dynamic programming. Q should tell you the discounted sum of rewards, and thus can be framed recursively as a local update.

Given state s_t, action a_t, reward r_{t+1} for taking a_t at s_t, discount factor \gamma

Q_{\textup{new}}(s_t, a_t) \gets \left(r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') \right)

Thus, written a least squares loss function L_Q

L_Q(s_t, a_t) = \left(Q(s_t, a_t) - \left(r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') \right) \right)^2

Actor Critic — Handling Continuious Space

In discrete space, \max_{a'} is computable as we can enumerate all possible actions. In continuious action space, this is not possible. Thus, we must replace this exhaustive max with a learned “actor” that takes actions, with the Q function taking the role of “critic”.

Deep Deterministic Policy Gradient (DDPG)

In DDPG, the actor learns a simple determinstic mapping from state s_t to action a_t, with noise added for exploration during data collection, i.e.

a_t = \mu_\theta(s_t) +\mathcal{N}(0, \sigma^2)

during training and

a_t = \mu_\theta(s_t)

during inference. Thus, for a given batch of data, the critic can be optimized via a modified Q loss, i.e.

L_Q(s_t, a_t) = \left(Q(s_t, a_t) - \left(r_{t+1} + \gamma Q(s_{t+1}, \mu_\theta(s_{t+1})) \right) \right)^2

and then the actor optimized to maximize the Q value; this can be done by simply minimizing the negative of the Q function, i.e.

L_\mu(s_t, \theta) = -Q(s_t, \mu_\theta(s_t))